Elsevier

Neurocomputing

Volume 456, 7 October 2021, Pages 200-219
Neurocomputing

A coarse-to-fine capsule network for fine-grained image categorization

https://doi.org/10.1016/j.neucom.2021.05.032Get rights and content

Abstract

Fine-grained image categorization is challenging due to the subordinate categories within an entry-level category can only be distinguished by subtle discriminations. This necessitates localizing key (most discriminative) regions and extract domain-specific features alternately. Existing methods predominantly realize fine-grained categorization independently, while ignoring that representation learning and foreground localization can reinforce each other iteratively. Sharing the state-of-the-art performance of capsule encoding for abstract semantic representation, we formalize our pipeline as a coarse-to-fine capsule network (CTF-CapsNet). It consists of customized expert CapsNets arranged in each perception scale and region proposal networks (RPNs) between two adjacent scales. Their mutually motivated self-optimization can achieve increasingly specialized cross-utilization of object-level and component-level descriptions. The RPN zooms the areas to turn the attention to the most distinctive regions by concerning preceding informations learned by expert CapsNet for references, whilst a finer-scale model takes as feed an amplified attended patch from last scale. Overall, CTF-CapsNet is driven by three focal margin losses between label prediction and ground truth, and three regeneration losses between original input images/feature maps and reconstructed images. Experiments demonstrate that without any prior knowledge or strongly-supervised supports (e.g., bounding-box/part annotations), CTF-CapsNet can deliver competitive categorization performance among state-of-the-arts, i.e., testing accuracy achieves 89.57%, 88.63%, 90.51%, and 91.53% on our hand-crafted rice growth image set and three public benchmarks, i.e., CUB Birds, Stanford Dogs, and Stanford Cars, respectively.

Introduction

Generally, visual classification can be separated into three categories, i.e., subordinate-level, basic-level, and superordinate-level classification [1]. Relatively more progress has been made in the field of basic-level categorization, e.g., categorizing vehicles, aircrafts, birds, and flowers. More difficult than basic-level categorization, the little-known subordinate-level classification (also known as fine-grained image classification) intends to distinguish targets belonging to some homologous sub-ordinate classes. Those classes derive directly from an entry-level class [1], hence the discriminations between them are highly similar and can only be perceived by domain experts. Obvious inner-class variances and inter-class similarity make fine-grained visual categorization more challenging than conventional pattern recognition. Due to such obstacle, fine-grained image classification technology also benefits extensive scenarios, e.g., expert-level target detection [2], [3], semantic image caption [4], [5], [6].

By shaping robust visual descriptors, prior strategies [7], [8] have significantly lifted the accuracy of fine-grained image classification, e.g., pushing precision on Caltech UCSD Birds (CUB-200-2011) [9] from 17.31% [9] to 85.4% [7]. Since the main difficulty stems from the fact that discriminations are localized not just on foreground area, but more meaningfully in subtler regions (e.g., the wings of a butterfly). Thereby, many studies obey this pipeline:

  • )1.

    Localizing hypothesis interest areas by evaluating confidence feedback from models or by utilizing extra artificial supports, e.g., bounding-box/key point annotation.

  • )2.

    Learning discriminations from domain-specific areas and encoding them into feature vectors for recognition.

They normally aim at shifting the attentions from object-level representations to local part-level representations. As for object-level representations, they are commonly perceived by the features extracted from the interest areas of hand-crafted or recommended bounding-boxes/part annotations. By filtering inconsequential backgrounds or other noises, and comprehensively depicting the whole foreground regions, global-level representations reveal reasonably satisfying performance [10]. Localised component-level descriptions are motivated by the law that the subtler discriminations among classes always exist in domain-specific regions with enough discriminability. Detailed review of related researches will be demonstrated in Section 2.

Although significant progresses have been made by aforementioned methods, they still encounter following legacies.

  • )1.

    Many strategies [2], [5], [8], [11], [12], [13], [14], [15], [16], [17] are part-based perception, wherein interest components are either manually annotated or detected by optimable detectors. For the former, the annotation is promising but the procedure might be labor-intensive and time-consuming. For the latter, a set of component detectors are defined heuristically for a particular sample set. Hence their generalization is limited.

  • )2.

    Excessive involvement of strong supervision hand-crafted supports [11], [13], e.g., bounding-boxes and part annotations.

We notice that humans habitually perceive objects in an incremental fashion. For example, as depicted in Fig. 1 for vehicle recognition, we first understand that it is a car according its outline (object-level features). Then, the exhaust cylinder, streamlined windshield, and tire (part-level features) reveal that the car comes from the sports car family. Finally, based on the letters on the license plate (finer-scale features), we can conclude that the car is a sports car affiliated with a specific company. We also discover that the target detection and feature extraction are interconnected and can motivate each other: accurate regions localization can promote extracting discriminative descriptions. This, in turn, will help localize representative regions more precisely. Conversely, unpromising localized regions and feature extraction will backfire. Such increasingly specialized object detection manner can also be found from the CrossNet [18].

Recently, deep learning algorithms represented by convolutional neural networks (CNNs) have shown potential for computer visual tasks [19], e.g., image classification [20], target detection [21], and semantic segmentation [22]. The capacity for representation reasoning and abstraction allows these techniques to provide a special perspective for visual semantic understanding [23], [24], [25] and fine-grained classification [11], [12], [13], [14], [15], [16], [17].

Although CNN-based strategies have made progress, there are two conspicuous flaws [26], [27]: (1) their ignorance of spatial hierarchies between multiple entities; and (2) their lack of rotational invariance. Thereby, many CNN-based researches for fine-grained categorization usually concentrate the subtler visual cues among the spectral and spatial domains; however, the spatial relevance and logicality among tiny patches have seldom been taken into account. Those spatial relevance and logicality may consist of features, angles, sizes, locations, contexts, pyramids, or even ultrametrics based on the spatially perceptive relationships.

To overcome above legacies of CNN, Sabour and Hinton et al. [27] presented a novel encoding unit of neural networks denoted as capsule networks (CapsNets). Via the dynamic routing agreement (i.e., modifying or replacing the conventional backpropagation process in CNN) and vectoring the outputs (i.e., substituting for the scalar output), CapsNets can embed the features into capsule encoding units and connect the neighbor layers. Moreover, the lower-level encoding units predict the outcome of higher-level capsule vectors, while the capsules in higher levels get activated only if those predictions pass the vote. Thanks to this mechanism, CapsNet can decide the optimal routes among capsules and the credit attribution between nodes from lower and higher layers, namely, to cluster the extracted features for each category. So, intuitively this property of CapsNet is recommended for fine-grained image classification to discriminate the visual distinctions.

In this work, we capitalize on achievability of these benefits and adopt CapsNet for fine-grained visual categorization. We formulate our progressive perception pipeline as a novel hierarchical model denoting coarse-to-fine capsule network (CTF-CapsNet), in which the most distinctive regions are localized via a region proposal network (RPN) (proposed in our previous research [19]) on the basic of comprehensive feature maps learned by a modified five-layer expert CapsNet (a variant of conventional CapsNet architecture [27]) arranged in each parallel perception scale.

Generally, CTF-CapsNet is a hierarchy model which takes the input from source images to finer-scale domain-specific regions at three perception levels. Its feature learning and patch proposal are recursively conducted in a coarse-to-fine fashion. The modified five-layer expert CapsNets and RPNs [19] constitute a self-optimization closed-loop, and finer-grained perceptions can continue to be stacked in a similar way. Eventually, its global features in shallow levels can imply holistic cues, while the domain-specific features in relative high levels can describe subordinate-level variation. Without any prior knowledge/part annotations about the discriminative areas, it enables end-to-end model converge for fine-grained image classification simply requiring the class-level labels.

CTF-CapsNet breaks through the conventional linear network topology and obey an increasingly specialized perception when employing representations from multi scales of the bottom-up representative paradigm. Since the perception of CTF-CapsNet is in line with the increasingly specialized trend, its focus can gradually shift to the most representative areas from coarse to fine (cf. Fig. 1, from outline to head, then to eyes and beak for a bird). Note that the robust region-based feature learning can reinforce distinction localization, and vice versa. Thus, the whole network can relish precision benefits from the mutual motivation between patch proposal and feature extraction with no whistles and bells (e.g., iterative box-refinement).

For performance evaluation, CTF-CapsNet is compared with state-of-the-arts on three public benchmarks (i.e., CUB-200-2011) [9], Stanford Cars [28], and Stanford Dogs [29]) and our hand-crafted rice image set. We narrate a consistent and satisfying improvement in performance on ablation verification across modifications. The major contributions of this paper are summarized as follows:

  • )1.

    To the best of our knowledge, this paper represents the first attempt of modifying a coarse-to-fine CapsNet for fine-grained categorization. Comprehensive experiments reveal its superior performance over state-of-the-arts.

  • )2.

    To boost performance, we proposed a multi-scale expert CapsNet in each perception scale for feature learning and reasoning.

  • )3.

    We customized a novel dynamic routing agreement for model convergence.

  • )4.

    For effectiveness verification, 21,143 rice growth images spanning 38 plasma treatment schemas are collected as experimental samples.

The rest of the paper is structured as follows. Section 2 demonstrates the prior researches. Section 3 illustrates the methodology of the proposed CTF-CapsNet. Section 4 elaborates the experiment details, followed by the in-depth discussion in Section 5. Finally, Section 6 presents concluding remarks and future works.

Section snippets

Related works

Along with our research, we will overview the related works about fine-grained image classification and CapsNet.

Coarse-to-fine capsule network (CTF-CapsNet)

Sharing the state-of-the-art performance of CapsNet, we formula our pipeline with three scales as a multi-dimensional CapsNet (cf. Fig. 2). It encodes hierarchical features with an increasingly specialized manner, and subtler perceptions can be stacked in the similar manner. The three scales in CTF-CapsNet are only utilized to show the pipeline of our model, it is not limited to three scales or any definite number of scales. The number of perception scales can be changed adaptively according to

Experimental evaluation

In this section, we verify the effectiveness of CTF-CapsNet in terms of fine-grained visual categorization over state-of-the-art baselines. A series of comprehensive experiments are conducted in customized training and testing phase on hand-crafted and publicly available benchmarks. See the following demonstrations for experimental details.

Discussion

The main argument of our work is that the proposed coarse-to-fine perception realized on the compatible expert CapsNets is an efficient pattern for weakly supervised fine-grained image classification. Experimentally, we first determine the most appropriate values of downweighting factor λ of overall loss function, scale-down factor of reconstruct loss, and mask rate of DropConnect layer through ablation verifications. On the basic of those fixed parameters, we continue following quantitative

Conclusion and future works

To solve the fine-grained image classification without external strongly-supervised supports end-to-end, we proposed a coarse-to-fine CapsNet, to shape an increasingly specialized description. It can not only recursively implement feature learning and patch proposal, but also reinforce each other for booting performance. Extensive experiments demonstrate our advantage in terms of fine-grained recognition and attention localization on rice, vehicles, dogs, and birds, that can compete against

CRediT authorship contribution statement

Zhongqi Lin: Conceptualization, Methodology, Software, Validation, Formal analysis, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization. Jingdun Jia: Supervision, Project administration, Funding acquisition. Feng Huang: Supervision, Project administration, Funding acquisition. Wanlin Gao: Resources, Data curation, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the Project of Scientific Operating Expenses, Ministry of Education of China, under Grant 2017PT19; in part by the National Natural Science Foundation of China, under Grant No. 12075315; in part by the National Natural Science Foundation of China, under Grant No 11675261, in part by the National Natural Science Foundation for the Youth of China, Natural Science Foundation of Shandong Province, under Grant ZR2018QF002, and in part by the Provincial Project,

Zhongqi Lin is currently pursuing the Ph.D. degree with the College of Information and Electrical Engineering, China Agricultural University, Beijing, China. His current research interests include deep learning for target classification, deep transfer learning for object recognition and segmentation, deep reinforcement learning for prediction, machine learning, data-intensive parallel programming, and digital image processing techniques, including enhancement, compression, and denoising, with

References (58)

  • Z. Lin et al.

    Fine-grained visual categorization of butterfly specimens at sub-species level via a convolutional neural network with skip-connections

    Neurocomputing

    (2020)
  • H. Yao, S. Zhang, Y. Zhang, J. Li and Q. Tian, Coarse-to-Fine Description for Fine-Grained Visual Categorization, in...
  • J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and F.F. Li, The unreasonable effectiveness...
  • T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng and Z. Zhang, The application of two-level attention models in deep...
  • H. L. Anne, V. Subhashini, R. Marcus, M. Raymond, S. Kate, and T. Darrell, Deep compositional captioning: Describing...
  • J. Johnson, A. Karpathy, and F. F. Li, Densecap: Fully convolutional localization networks for dense captioning, In...
  • S. Branson, G. Van Horn, S. Belongie, and P. Perona, Bird species categorization using pose normalized deep...
  • N. Zhang, J. Donahue, R. Girshick, and T. Darrell, Part-based R-CNNs for fine-grained category detection, in Computer...
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, The caltech-ucsd birds-200-2011 dataset, Comput. Neural...
  • Z. Lin et al.

    A hierarchical coarse-to-fine perception for small-target categorization of butterflies under complex backgrounds

    J. Intell. Fuzzy Syst.

    (2020)
  • S. Huang, Z. Xu, D. Tao and Y. Zhang, Part-stacked CNN for fine-grained visual categorization, in Proc. IEEE Conf....
  • Y. Peng et al.

    Object-part attention model for fine-grained image classification

    IEEE Trans. Image Process.

    (2018)
  • H. Zhang et al., SPDA-CNN: Unifying semantic part detection and abstraction for fine-grained recognition, in Proc. IEEE...
  • X.S. Wei, C.W. Xie, J.X. Wu, Mask-CNN: Localizing parts and selecting descriptors for fine-grained image recognition,...
  • C. Pang, H. Li, A. Cherian, H. Yao, Part-based fine-grained bird image retrieval respecting species correlation, in...
  • J. Fu, H. Zheng, T. Mei, Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained...
  • H. Zheng, J. Fu, T. Mei, J. Luo, Learning Multi-attention Convolutional Neural Network for Fine-Grained Image...
  • J. Leng, Y. Liu, Z. Wang, H. Hu and X. Gao, CrossNet: detecting objects as crosses, in IEEE Transactions on Multimedia,...
  • Z. Lin et al.

    Increasingly specialized perception network for fine-grained visual categorization of butterfly specimens

    IEEE Access

    (2019)
  • M. Biglari, A. Soleimani, H. Hassanpour, A Cascaded Part-Based System for Fine-Grained Vehicle Classification, IEEE....
  • J. Fang et al.

    Fine-grained vehicle model recognition using a coarse-to-fine convolutional neural network architecture

    IEEE. Trans. Intell. Transp. Syst.

    (2017)
  • L.-C. Chen et al.

    DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • Z. Lin et al.

    A novel quadruple generative adversarial network for semi-supervised categorization of low-resolution image

    Neurocomputing

    (2020)
  • C. Szegedy, et al., Going deeper with convolutions, in: Proc. IEEE International Conference on Computer Vision and...
  • K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE...
  • E. Xi, S. Bing, Y. Jin, Capsule Network Performance on Complex Data, arXiv preprint, 2017; arXiv:1712.03480.Available:...
  • S. Sabour, N. Frosst, G.E. Hinton, Dynamic routing between capsules, In Proceedings of the Advances in Neural...
  • M. Liu, C. Yu, H. Ling, et al. Hierarchical joint cnn-based models for fine-grained cars recognition, International...
  • A. Khosla, N. Jayadevaprakash, B. Yao, & F. F. Li, Novel dataset for fine-grained image categorization: Stanford dogs,...
  • Cited by (7)

    • Automatic and accurate segmentation of peripherally inserted central catheter (PICC) from chest X-rays using multi-stage attention-guided learning

      2022, Neurocomputing
      Citation Excerpt :

      Network cascading is usually used to select different structures to deal with problems of different difficulties. Coarse-to-fine semantic segmentation is often used to process high-resolution images [36–38]. To segment the small details of a high-resolution image, coarse-to-fine method is to obtain a rough result first, and then combine the multi-scale information to refine the segmentation.

    • XnODR and XnIDR: Two Accurate and Fast Fully Connected Layers for Convolutional Neural Networks

      2023, Journal of Intelligent and Robotic Systems: Theory and Applications
    • Marine Moving Target Classification Based on Capsule Network with Feature Enhancement

      2022, ICIEA 2022 - Proceedings of the 17th IEEE Conference on Industrial Electronics and Applications
    View all citing articles on Scopus

    Zhongqi Lin is currently pursuing the Ph.D. degree with the College of Information and Electrical Engineering, China Agricultural University, Beijing, China. His current research interests include deep learning for target classification, deep transfer learning for object recognition and segmentation, deep reinforcement learning for prediction, machine learning, data-intensive parallel programming, and digital image processing techniques, including enhancement, compression, and denoising, with specific interests in fine-grained visual categorization. ORCID ID: http://orcid.org/0000–0002-3273–0783.

    Jingdun Jia Doctor, He is researcher of China rural technology development center. He is also committee member of the policy advisory board for Australian center for international agricultural research, adjunct professor, China agricultural university. He has long been engaged in development strategy, plan, and policy for science and technology management. He also research on agricultural and rural development, regional development strategy. He has conducted in-depth research on rural scientific and technological innovation, agricultural biotechnology and food industry, biological energy and biomass industry, nutrition and health, and intelligent agricultural scientific and technological innovation. ORCID ID: https://orcid.org/0000–0001-9333–6934.

    Wanlin Gao is the current Dean of College of Information and Electrical Engineering of China Agricultural University. He is also the member of Science and Technology Committee of the Ministry of Agriculture, the member of Agriculture and Forestry Committee of Computer Basic Education in Colleges and Universities, senior member of Society of Chinese Agricultural Engineering, etc. He received degrees (B.S., 1990; S.M., 2000; Ph.D., 2010) from China Agricultural University. His major research area is the informationization of new rural areas, intelligence agriculture and the service for rural comprehensive information. He has been the principal investigator (PI) of over 20 national plans and projects. He has published 90 academic papers in domestic and foreign journals, and among them, over 40 are cited by SCI/EI/ISTP. He has written 2 teaching materials which are supported by National Key Technology R&D Program of China during the 11th Five-Year Plan Period, and 5 monographs. Moreover, he owns 101 software copyrights, 11 patents for inventions, and 8 patents for new practical inventions. ORCID ID: https://orcid.org/0000–0002-4845–4541

    Feng Huang is a professor from College of Science, China Agricultural University. She received her Ph.D. degree in June 2005 from Institute of Physics, Chinese Academy of Sciences. Her research area mainly focuses on the experiments and computer simulations of plasma physics. ORCID ID: http://orcid.org/0000–0001-9866–6684.

    View full text