Transformer helps identify kiwifruit diseases in complex natural environments

https://doi.org/10.1016/j.compag.2022.107258Get rights and content

Highlights

  • The Transformer can effectively suppress the interference of complex background information on disease recognition.

  • The overlapping patch embedding improves the local continuity of the input image information of Vision Transformer.

  • We build a scalable and efficient identification model ConvViT to identify kiwifruit disease in complex environments with fewer parameters and computation.

  • ConvViT has good generalization and is a potential backbone network for feature extraction.

Abstract

The complex background of disease images and the small contrast between the disease area and the background easily confuse, seriously affecting the robustness and accuracy of kiwifruit disease identification models. To address the above problems, this paper proposes a disease identification model based on Vision Transformer and Convolutional Neural Network, ConvViT(Convolutional Neural Network and Vision Transformer), to identify diseases by extracting effective features of kiwifruit disease spots. The proposed ConvViT includes convolutional structure and Transformer structure: The convolutional structure is used to extract the global features of the image, and the Transformer structure is used to obtain the local features of the disease area to help the CNN see better. Meanwhile, the paper designs different models according to the number of parameters and FLOPs (floating-point operations) to improve the model's scalability. The model variants of different sizes are designed to be lightweight to run on devices with different resource constraints. We achieved 98.78% identification accuracy on the self-built kiwifruit disease dataset, with up to 4.53% improvement in identification accuracy compared to the same level of Resnet, ViT, and ResMLP, and more than 10% reduction in the number of parameters and FLOPs. Experimental results on the PlantVillage dataset and the AI Challenger 2018 also show that ConvViT has good generalizability, indicating that the proposed model can solve kiwifruit disease identification problems in complex environments and be valuable a backbone network for other identification tasks with practical applications.

Introduction

Diseases seriously threaten the development of the kiwifruit industry, leading to a decrease in yield and quality. In the last ten years, the global growth rate of kiwifruit cultivation area and production has reached 71.25% and 55.58%, respectively, and it has been among the world's mainstream consumer fruits. Therefore, accurate and timely disease diagnosis is important, especially in complex weather changes, light variations, and field backgrounds for kiwifruit disease identification. In general, farmers diagnose diseases based on their farming experience or agricultural experts' guidance. As the scale of the industry has increased, it has become clear that the above methods do not meet practical needs. When a disease occurs in a crop, the surface of the leaves shows symptoms of the disease, such as changes in the shape, color, and texture of the leaves, which are characteristic information, and this provides a new way of thinking for the automatic identification of kiwifruit diseases using machine vision techniques. Traditional machine vision techniques require manual extraction of disease features such as color, shape, and texture. They then reduce features using traditional artificial intelligence algorithms such as PCA, genetic algorithms, SVM, and KNN for disease identification and diagnosis. Such methods have been applied to wheat (Khairnar et al., 2014), maize (Zhang et al., 2014), cucumber (Zhang et al., 2017), etc. However, such methods use relatively small datasets, and manually extracted features are not optimal for disease identification, resulting in low identification effect (90%–95%), the model's universality is not strong, and poor model migration performance.

In recent years, convolutional Neural Networks (CNNs) have also been widely used for crop disease identification. CNNs have a more flexible topology and better feature representation. Therefore, it is widely used in image classification, object detection, and semantic segmentation tasks. Deep convolutional neural networks (DCNNs) are formed by adding convolutional and pooling layers to CNNs, which further enhances the representation capability of the network. Compared with traditional methods, DCNNs have better robustness and higher identification accuracy. For example, the improved DCNNs were applied to apple (Liu et al., 2017), banana (Amara et al., 2017), tomato (Brahimi et al., 2017), kiwifruit (Liu et al., 2020), and other crops for disease identification, all of which achieved better identification results, but the size of the dataset used was relatively small. As the dataset size increases, the CNNs model's identification accuracy decreases. And the image background of the dataset used in many works is relatively simple. Although more ideal results are achieved, the performance in the actual application scenarios is relatively poor. Images taken in real application scenarios have varying degrees of complex backgrounds, such as weather changes, the intensity of light, and other objects on the ground. These non-disease-related features may negatively affect the diagnosis results. Thus, many studies have used segmentation-based methods to segment leaf or disease spot information from the background to solve this problem (Zhang et al., 2019a, Zhang et al., 2019b, Karlekar and Seal, 2020, Xiong et al., 2020). However, the usability of these disease identification models in different complex backgrounds is not strong. We were thinking about the question, “Can we start from the characteristics of the network model itself to solve the problem of crop disease identification in complex backgrounds?”.

It is well known that the convolutional structure involves only local connections. CNNs obtain salient features by sliding a large number of convolutional kernels over the input image as a basis for distinguishing disease diagnosis. However, the local nature of the CNNs structure makes it inevitable that the acquired features will have background features that are not related to the disease features. Moreover, intuitively, to obtain better identification results, the model should focus more on the disease region and give less attention to other features, so many works have incorporated an attention-based structure in CNNs models (Zeng et al., 2020). However, these works do not have a uniform statement on where to add attention mechanisms and how much attention structure to add to the network, which can lead to many problems, for example, if adding attention mechanisms in the initial stage of the model can lead to a more computationally intensive model that focuses too much on local features and does not achieve the expected results. Adding only a single attention mechanism may lead to overfitting of the model, etc. The forms and principles of the attention mechanisms they use are different and specialized. However, the Transformer’s attention mechanism (self-attention) has a universal form. The property differences between convolutional architecture, Transformer, and our ConvViT are shown in Table 1.

The Transformer structure, a standard paradigm in natural language processing (Devlin et al., 2018), has recently made a splash in computer vision. It can model the global dependencies of images at pixel level (Dosovitskiy et al., 2020). Touvron et al. (2021a) introduced a knowledge distillation scheme to propose a data efficient Transformer.

Yuan et al. (2021) progressively structured images into tokens by recursively aggregating neighboring tokens into one token. Han et al. (2021) proposed inner and outer transformers to model the relationship between token embedding and sentence embedding. Wang et al., 2021, Liu et al., 2021a, Liu et al., 2021b simultaneously proposed hierarchical structures suitable for intensive prediction tasks such as object detection, semantics, and instance segmentation. It can be said that Transformer has surpassed CNNs to some extent. The Transformer is based on a multi-headed attention mechanism with a global perceptual field that allows information to flow freely in different locations of the image, establishing long-range dependencies that can give more attention to kiwifruit disease features while ignoring environmental factors. This property is naturally suitable for kiwifruit disease identification in complex environments. However, there are some problems with ViT itself. For example, ViT inherits the practice of Transformer structure in natural language processing, dividing the input image into different patches and flattening them into one-dimensional features. Unlike natural language, the semantic information of an image is pixel level, and the method of dividing the image into patches makes the image lose the continuity of local information. Besides, the muti-head-attention (MHA) structure in Transformer structure makes the parameters in the model based on Transformer larger and the calculation more complex. However, these defects of ViT are exactly the advantages of CNNs models. If the two can be combined, the effect of ViT in practical application can be improved.

In this paper, we propose a scalable and effective kiwifruit disease identification model, ConvViT, in a complex environment by combining Transformer structure and convolution structure and demonstrate their effectiveness and the reasons for their effects from three aspects: theoretical derivation, experiments, and visualization. The design of ConvViT refers to existing work, ViT (Dosovitskiy et al., 2020), MLP-Mier (Tolstikhin et al., 2021), and practical experience making the Transformer structure and convolution structure are combined more rationally to effectively improve kiwifruit disease identification in complex environments. To enhance the usability of our model, we made a series of improvements to the original ViT, and these improvements effectively reduce the parameters and FLOPs. We provided three different model variants to meet the deployment on devices with different resource constraints. The experimental results on the PlantVillage dataset (Hughes et al., 2015) and the AI Challenger 2018 dataset further validate the effectiveness of ConvViT.

The core contributions of this paper can be summarized as follows:

  • 1.

    We propose to solve the problem of kiwifruit disease identification in complex environments by using Transformer structure. Based on the self-attention mechanism, the Transformer can guide the convolution structure to focus on the features that are effective for identification results, making the CNN see better.

  • 2.

    We improve the patch embedding method of the original ViT model and reduce its computational complexity. The improved overlapping patch embedding approach facilitates the information exchange between adjacent patches, preserves the information of image edges, and ensures the continuity of image local information. The improved multi-head attention layer possesses linear computational complexity and reduces the parameters and FLOPs of the model.

  • 3.

    We propose a scalable, effective, and efficient kiwifruit disease identification model under complex backgrounds. ConvViT scientifically combined the strengths of convolution structure and Transformer to achieve a maximum 4.53% improvement in identification accuracy in both its kiwifruit disease dataset and two public datasets, with a more than 10% reduction in parameters and FLOPs compared to the popular model.

Section snippets

Self-build datasets

The kiwifruit disease dataset used in this paper was collected by a BM-500 GE / BB-500 GE digital camera in 2020 at the Kiwifruit Experiment Station of Northwest A&Forestry University, Shaanxi Province, China, with a total of 2115 images. The dataset includes six leaf diseases such as Brown leaf spot, Mosaic, and Anthracnose because of their obvious external characteristics. Fig. 1shows some example images.

The dataset was randomly divided into training, test sets in this study in the ratio of

Experiments on the real-world kiwifruit leaf disease dataset

This paper select three mainstream models based on different architectures for comparison. The Resnet network (He et al., 2016), based on convolutional structure as a baseline network, has achieved remarkable success in many identification tasks, the ViT (Dosovitskiy et al., 2020) based on Transformer structure introduced Transformer into computer vision for the first time and it achieved competitive results, and the ResMLP (Touvron et al., 2021b), based on MLP structure, MLP has a simple

Discussion

In this paper, a Transformer-based CNN model is constructed to solve the kiwifruit disease identification problem in complex backgrounds. The proposed model possesses higher accuracy and robustness in complex backgrounds. Experimental results on a homemade complex background kiwifruit disease dataset show that the proposed ConvViT model improves recognition accuracy by 4.53% (98.78%) and reduces the number of parameters and FLOPs by more than 10% compared with mainstream Resnet networks, ViT

Conclusions

This paper finds that the Transformer-based CNNs model can significantly improve kiwifruit disease recognition in complex backgrounds. The Transformer structure and the convolutional structure are two different architectures with very different properties. However, in practice, we have found that these two structures can be combined reasonably to improve the effectiveness of real disease recognition tasks. The model proposed in this paper is a more general form of combining the self-attention

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by the National Key Research and Development Program of China under Grants (Grants NO. 2020YFD1100600) and (Grants NO. 2020YFD1100601). The authors appreciate the funding organization for its financial support.

References (32)

  • P. Zhang et al.

    EfficientNet-B4-Ranger: A novel method for greenhouse cucumber disease recognition under natural complex environment

    Comput. Electron. Agric.

    (2020)
  • W. Zeng et al.

    Crop leaf disease recognition based on Self-Attention convolutional neural network

    Comput. Electron. Agric.

    (2020)
  • Amara, J., Bouaziz, B., Algergawy, A., 2017. A deep learning-based approach for banana leaf diseases classification....
  • M. Brahimi et al.

    Deep learning for tomato diseases: classification and symptoms visualization

    Applied Artificial Intelligence

    (2017)
  • Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for...
  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... Houlsby, N., 2020. An image...
  • Cited by (23)

    • Plant image recognition with deep learning: A review

      2023, Computers and Electronics in Agriculture
    View all citing articles on Scopus
    View full text