Exploring vision transformer: classifying electron-microscopy pollen images with transformer

Duan, Kaibo; Bao, Shi; Liu, Zhiqiang; Cui, Shaodong

doi:10.1007/s00521-022-07789-y

Exploring vision transformer: classifying electron-microscopy pollen images with transformer

Original Article
Published: 20 September 2022

Volume 35, pages 735–748, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Kaibo Duan¹,
Shi Bao ORCID: orcid.org/0000-0002-1107-5679¹,
Zhiqiang Liu¹ &
…
Shaodong Cui¹

472 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Pollen identification is a sub-discipline of Palynology, which has broad applications in several fields such as allergy control, paleoclimate reconstruction, criminal investigation, and petroleum exploration. Among these, pollen allergy is a common and frequent disease worldwide. Accurate and rapid identification of pollen species under the electron microscope help medical staff in pollen forecast and interrupt the natural course of pollen allergy. The current pollen species identification needs to rely on professional researchers to identify pollen particles in pictures manually, and this time-consuming and laborious way cannot meet the requirements of pollen forecasting. Recently, the self-attention based Transformer has attracted considerable attention in vision tasks, such as image classification. However, pure self-attention lacks local operations on pixels and requires large-scale dataset pretraining to achieve comparable performance to convolutional neural networks (CNN). In this study, we propose a new Vision Transformer pipeline for image classification. First, we design a FeatureMap-to-Token (F2T) module to perform token embedding on the input image. A global self-attention operation is performed on the basis of tokens with local information, and the hierarchical design of CNN is applied to the Vision Transformer, combining local and global strengths in multiscale spaces. Second, we use a distillation strategy to learn the feature representation in the output space of the teacher network to further learn the inductive bias in the CNN to improve the recognition accuracy. Experiments demonstrate that the proposed model achieves CNN-equivalent performance under the same conditions after being trained from scratch on the electron-microscopic pollen dataset. It also requires less model parameters and training time. Code for the model is available at https://github.com/dkbshuai/PyTorch-Our-S.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fruit ripeness identification using YOLOv8 model

Article Open access 31 August 2023

Medicinal Plant Identification in Real-Time Using Deep Learning Model

Article Open access 07 December 2023

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Data availability

The pollen dataset generated and analysed during the current study is not publicly available because the study is ongoing, but is available from the corresponding author on reasonable request.

References

Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229
Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W (2021) Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12299–12310
Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV (2019) Autoaugment: learning augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 113–123
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
D’Amato G, Cecchi L, Bonini S, Nunes C, Annesi-Maesano I, Behrendt H, Liccardi G, Popov T, Van Cauwenberge P (2007) Allergenic pollen and pollen allergy in Europe. Allergy 62(9):976–990
Article Google Scholar
Gao F, Mu X, Ouyang C, Yang K, Ji S, Guo J, Wei H, Wang N, Ma L, Yang B (2022) Mltdnet: an efficient multi-level transformer network for single image deraining. Neural Comput Appl, pp 1–15
Ghofrani A, Mahdian Toroghi R (2022) Knowledge distillation in plant disease recognition. Neural Comput Appl, pp 1–10
Goncalves AB, Souza JS, Silva GGd, Cereda MP, Pott A, Naka MH, Pistori H (2016) Feature extraction and machine learning for the classification of Brazilian savannah pollen grains. PLoS one 11(6):e0157044
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hendrycks D, Gimpel K (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units
Heo B, Yun S, Han D, Chun S, Choe J, Oh SJ (2021) Rethinking spatial dimensions of vision transformers. arXiv preprint arXiv:2103.16302
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Hughes D, Salathé M, et al. (2015) An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv preprint arXiv:1511.08060
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25:1097–1105
Google Scholar
LeCun Y, Haffner P, Bottou L, Bengio Y (1999) Object recognition with gradient-based learning. In: Shape, contour and grouping in computer vision, Springer, pp 319–345
Ledig C, Theis L, Huszár F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4681–4690
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10012–10022
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
Mao X, Qi G, Chen Y, Li X, Duan R, Ye S, He Y, Xue H (2021) Towards robust vision transformer. arXiv preprint arXiv:2105.07926
Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training
Radosavovic I, Kosaraju RP, Girshick R, He K, Dollár P (2020) Designing network design spaces. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10428–10436
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99
Google Scholar
Sarwar AKMG, Hoshino Y, Araki H (2015) Pollen morphology and its taxonomic significance in the genus bomarea mirb. (alstroemeriaceae)-i. subgenera baccata, sphaerine, and wichuraea. Acta bot bras 29:425–432
Article Google Scholar
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. arXiv preprint arXiv:1803.02155
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Szibor R, Schubert C, Schöning R, Krause D, Wendt U (1998) Pollen analysis reveals murder season. Nature 395(6701):449–450
Article Google Scholar
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, PMLR, pp 6105–6114
Tang Y, Wang B, He W, Qian F (2022) Pointdet++: an object detection framework based on human local features with transformer encoder. Neural Comput Appl, pp 1–12
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, PMLR, pp 10347–10357
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018a) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461
Wang X, Girshick R, Gupta A, He K (2018b) Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7794–7803
Wang Y, Xie Y, Fan L, Hu G (2022) Stmg: Swin transformer for multi-label image recognition with graph convolution network. Neural Comput Appl 34(12):10051–10063
Article Google Scholar
Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) Cvt: introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808
Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500
Yuan L, Chen Y, Wang T, Yu W, Shi Y, Jiang Z, Tay FE, Feng J, Yan S (2021) Tokens-to-token vit: Training vision transformers from scratch on imagenet. arXiv preprint arXiv:2101.11986
Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv preprint arXiv:1605.07146
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision, Springer, pp 818–833
Zhou D, Kang B, Jin X, Yang L, Lian X, Jiang Z, Hou Q, Feng J (2021) Deepvit: towards deeper vision transformer. arXiv preprint arXiv:2103.11886

Download references

Acknowledgements

This study was supported by National Natural Science Foundation of China (62066035, 61962044), Natural Science Foundation of Inner Mongolia Autonomous Region (2022LHMS06004), the basic scientific research business fee project of the universities directly under the Inner Mongolia Autonomous Region (JY20220089).

Author information

Authors and Affiliations

School of Information Engineering, Inner Mongolia University of Technology, Huhhot, P. R. China
Kaibo Duan, Shi Bao, Zhiqiang Liu & Shaodong Cui

Authors

Kaibo Duan
View author publications
You can also search for this author in PubMed Google Scholar
Shi Bao
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shaodong Cui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shi Bao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Duan, K., Bao, S., Liu, Z. et al. Exploring vision transformer: classifying electron-microscopy pollen images with transformer. Neural Comput & Applic 35, 735–748 (2023). https://doi.org/10.1007/s00521-022-07789-y

Download citation

Received: 25 January 2022
Accepted: 06 September 2022
Published: 20 September 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00521-022-07789-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring vision transformer: classifying electron-microscopy pollen images with transformer

Abstract

Access this article

Similar content being viewed by others

Fruit ripeness identification using YOLOv8 model

Medicinal Plant Identification in Real-Time Using Deep Learning Model

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploring vision transformer: classifying electron-microscopy pollen images with transformer

Abstract

Access this article

Similar content being viewed by others

Fruit ripeness identification using YOLOv8 model

Medicinal Plant Identification in Real-Time Using Deep Learning Model

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation