MLT-Trans: Multi-level Token Transformer for Hierarchical Image Classification

Boone Sifuentes, Tanya; Nazari, Asef; Bouadjenek, Mohamed Reda; Razzak, Imran

doi:10.1007/978-981-97-2259-4_29

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14647))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

145 Accesses

Abstract

This paper focuses on Multi-level Hierarchical Classification (MLHC) of images, presenting a novel architecture that exploits the “[CLS]” (classification) token within transformers – often disregarded in computer vision tasks. Our primary goal lies in utilizing the information of every [CLS] token in a hierarchical manner. Toward this aim, we introduce a Multi-level Token Transformer (MLT-Trans). This model, trained with sharpness-aware minimization and a hierarchical loss function based on knowledge distillation is capable of being adapted to various transformer-based networks, with our choice being the Swin Transformer as the backbone model. Empirical results across diverse hierarchical datasets confirm the efficacy of our approach. The findings highlight the potential of combining transformers and [CLS] tokens, by demonstrating improvements in hierarchical evaluation metrics and accuracy up to 5.7% on the last level in comparison to the base network, thereby supporting the adoption of the MLT-Trans framework in MLHC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bertinetto, L., Mueller, R., Tertikas, K., Samangooei, S., Lord, N.A.: Making better mistakes: leveraging class hierarchies with deep networks. In: Proceedings of the IEEE/CVF Conference, pp. 12506–12515 (2020)
Google Scholar
Boone-Sifuentes, T., Bouadjenek, M.R., Razzak, I., Hacid, H., Nazari, A.: A mask-based output layer for multi-level hierarchical classification. In: CIKM’22, pp. 3833–3837 (2022)
Google Scholar
Boone-Sifuentes, T., et al.: Marine-tree: large-scale marine organisms dataset for hierarchical image classification. CIKM ’22, New York, NY, USA (2022)
Google Scholar
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Chapter Google Scholar
Chen, M., et al.: Coarse-to-fine vision transformer. arXiv preprint arXiv:2203.03821 (2022)
Chou, P.Y., Kao, Y.Y., Lin, C.H.: Fine-grained visual classification with high-temperature refinement and background suppression. arXiv preprint arXiv:2303.06442 (2023)
Diao, Q., Jiang, Y., Wen, B., Sun, J., Yuan, Z.: MetaFormer: a unified meta framework for fine-grained recognition. arXiv preprint arXiv:2203.02751 (2022)
Dong, B., Zhou, P., Yan, S., Zuo, W.: Towards class interpretable vision transformer with multi-class-tokens. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp. 609–622. Springer (2022). https://doi.org/10.1007/978-3-031-18913-5_47
Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412 (2020)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Huo, Y., Lu, Y., Niu, Y., Lu, Z., Wen, J.R.: Coarse-to-fine grained classification. In: Proceedings of the ACM SIGIR Conference, pp. 1033–1036. SIGIR’19 (2019)
Google Scholar
Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: Stanford dogs. In: Proceedings of CVPR Workshop on Fine-Grained Visual Categorization (FGVC). vol. 2. Citeseer (2011)
Google Scholar
Kim, S., Nam, J., Ko, B.C.: ViT-NeT: interpretable vision transformers with neural tree decoder. In: International Conference on Machine Learning, pp. 11162–11172. PMLR (2022)
Google Scholar
Kosmopoulos, A., Partalas, I., Gaussier, E., Paliouras, G., Androutsopoulos, I.: Evaluation measures for hierarchical classification: a unified view and novel approaches. Data Min. Knowl. Disc. 29(3), 820–865 (2015)
Article MathSciNet Google Scholar
Liu, Y., Dou, Y., Jin, R., Qiao, P.: Visual tree convolutional neural network in image classification. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 758–763. IEEE (2018)
Google Scholar
Liu, Z., et al.: Swin Transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF Conference, pp. 10012–10022 (2021)
Google Scholar
Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. Tech. rep. (2013)
Google Scholar
Parag, T., Wang, H.: Multilayer dense connections for hierarchical concept classification. arXiv preprint arXiv:2003.09015 (2020)
Schmid, F., Masoudian, S., Koutini, K., Widmer, G.: Knowledge distillation from transformers for low-complexity acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (2022)
Google Scholar
Seo, Y., Shin, K.S.: Hierarchical convolutional neural networks for fashion image classification. Expert Syst. Appl. 116, 328–339 (2019)
Article Google Scholar
Silla, C.N., Freitas, A.A.: A survey of hierarchical classification across different application domains. Data Min. Knowl. Disc. 22(1), 31–72 (2011)
Article MathSciNet Google Scholar
Wood, L., Tan, Z., Stenbit, I., Bischof, J., Zhu, S., Chollet, F., et al.: Kerascv. https://github.com/keras-team/keras-cv (2022)
Xu, L., Ouyang, W., Bennamoun, M., Boussaid, F., Xu, D.: Multi-class token transformer for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference, pp. 4310–4319 (2022)
Google Scholar
Yan, Z., et al.: HD-CNN: hierarchical deep convolutional neural networks for large scale visual recognition. In: Proceedings of the IEEE ICCV Conference (2015)
Google Scholar
Zhang, Z., Zhang, H., Zhao, L., Chen, T., Arik, S.Ö., Pfister, T.: Nested hierarchical transformer: towards accurate, data-efficient and interpretable visual understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3417–3425 (2022)
Google Scholar
Zhu, X., Bain, M.: B-CNN: branch convolutional neural network for hierarchical classification. arXiv preprint arXiv:1709.09890 (2017)

Download references

Author information

Authors and Affiliations

Deakin University, Waurn Ponds, VIC, 3216, Australia
Tanya Boone Sifuentes, Asef Nazari & Mohamed Reda Bouadjenek
University of New South Wales, Sydney, NSW, 2052, Australia
Imran Razzak

Authors

Tanya Boone Sifuentes
View author publications
You can also search for this author in PubMed Google Scholar
Asef Nazari
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Reda Bouadjenek
View author publications
You can also search for this author in PubMed Google Scholar
Imran Razzak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tanya Boone Sifuentes .

Editor information

Editors and Affiliations

Academia Sinica, Taipei, Taiwan
De-Nian Yang
Microsoft Research Asia, Beijing, China
Xing Xie
National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Vincent S. Tseng
Duke University, Durham, NC, USA
Jian Pei
National Cheng Kung University, Tainan, Taiwan
Jen-Wei Huang
Silesian University of Technology, Gliwice, Poland
Jerry Chun-Wei Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boone Sifuentes, T., Nazari, A., Bouadjenek, M.R., Razzak, I. (2024). MLT-Trans: Multi-level Token Transformer for Hierarchical Image Classification. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14647. Springer, Singapore. https://doi.org/10.1007/978-981-97-2259-4_29

Download citation

DOI: https://doi.org/10.1007/978-981-97-2259-4_29
Published: 25 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2261-7
Online ISBN: 978-981-97-2259-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MLT-Trans: Multi-level Token Transformer for Hierarchical Image Classification