Abstract
This paper focuses on Multi-level Hierarchical Classification (MLHC) of images, presenting a novel architecture that exploits the “[CLS]” (classification) token within transformers – often disregarded in computer vision tasks. Our primary goal lies in utilizing the information of every [CLS] token in a hierarchical manner. Toward this aim, we introduce a Multi-level Token Transformer (MLT-Trans). This model, trained with sharpness-aware minimization and a hierarchical loss function based on knowledge distillation is capable of being adapted to various transformer-based networks, with our choice being the Swin Transformer as the backbone model. Empirical results across diverse hierarchical datasets confirm the efficacy of our approach. The findings highlight the potential of combining transformers and [CLS] tokens, by demonstrating improvements in hierarchical evaluation metrics and accuracy up to 5.7% on the last level in comparison to the base network, thereby supporting the adoption of the MLT-Trans framework in MLHC.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bertinetto, L., Mueller, R., Tertikas, K., Samangooei, S., Lord, N.A.: Making better mistakes: leveraging class hierarchies with deep networks. In: Proceedings of the IEEE/CVF Conference, pp. 12506–12515 (2020)
Boone-Sifuentes, T., Bouadjenek, M.R., Razzak, I., Hacid, H., Nazari, A.: A mask-based output layer for multi-level hierarchical classification. In: CIKM’22, pp. 3833–3837 (2022)
Boone-Sifuentes, T., et al.: Marine-tree: large-scale marine organisms dataset for hierarchical image classification. CIKM ’22, New York, NY, USA (2022)
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 446–461. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_29
Chen, M., et al.: Coarse-to-fine vision transformer. arXiv preprint arXiv:2203.03821 (2022)
Chou, P.Y., Kao, Y.Y., Lin, C.H.: Fine-grained visual classification with high-temperature refinement and background suppression. arXiv preprint arXiv:2303.06442 (2023)
Diao, Q., Jiang, Y., Wen, B., Sun, J., Yuan, Z.: MetaFormer: a unified meta framework for fine-grained recognition. arXiv preprint arXiv:2203.02751 (2022)
Dong, B., Zhou, P., Yan, S., Zuo, W.: Towards class interpretable vision transformer with multi-class-tokens. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pp. 609–622. Springer (2022). https://doi.org/10.1007/978-3-031-18913-5_47
Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,16\) words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412 (2020)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Huo, Y., Lu, Y., Niu, Y., Lu, Z., Wen, J.R.: Coarse-to-fine grained classification. In: Proceedings of the ACM SIGIR Conference, pp. 1033–1036. SIGIR’19 (2019)
Khosla, A., Jayadevaprakash, N., Yao, B., Li, F.F.: Novel dataset for fine-grained image categorization: Stanford dogs. In: Proceedings of CVPR Workshop on Fine-Grained Visual Categorization (FGVC). vol. 2. Citeseer (2011)
Kim, S., Nam, J., Ko, B.C.: ViT-NeT: interpretable vision transformers with neural tree decoder. In: International Conference on Machine Learning, pp. 11162–11172. PMLR (2022)
Kosmopoulos, A., Partalas, I., Gaussier, E., Paliouras, G., Androutsopoulos, I.: Evaluation measures for hierarchical classification: a unified view and novel approaches. Data Min. Knowl. Disc. 29(3), 820–865 (2015)
Liu, Y., Dou, Y., Jin, R., Qiao, P.: Visual tree convolutional neural network in image classification. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 758–763. IEEE (2018)
Liu, Z., et al.: Swin Transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF Conference, pp. 10012–10022 (2021)
Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. Tech. rep. (2013)
Parag, T., Wang, H.: Multilayer dense connections for hierarchical concept classification. arXiv preprint arXiv:2003.09015 (2020)
Schmid, F., Masoudian, S., Koutini, K., Widmer, G.: Knowledge distillation from transformers for low-complexity acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2022 Workshop (2022)
Seo, Y., Shin, K.S.: Hierarchical convolutional neural networks for fashion image classification. Expert Syst. Appl. 116, 328–339 (2019)
Silla, C.N., Freitas, A.A.: A survey of hierarchical classification across different application domains. Data Min. Knowl. Disc. 22(1), 31–72 (2011)
Wood, L., Tan, Z., Stenbit, I., Bischof, J., Zhu, S., Chollet, F., et al.: Kerascv. https://github.com/keras-team/keras-cv (2022)
Xu, L., Ouyang, W., Bennamoun, M., Boussaid, F., Xu, D.: Multi-class token transformer for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference, pp. 4310–4319 (2022)
Yan, Z., et al.: HD-CNN: hierarchical deep convolutional neural networks for large scale visual recognition. In: Proceedings of the IEEE ICCV Conference (2015)
Zhang, Z., Zhang, H., Zhao, L., Chen, T., Arik, S.Ö., Pfister, T.: Nested hierarchical transformer: towards accurate, data-efficient and interpretable visual understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3417–3425 (2022)
Zhu, X., Bain, M.: B-CNN: branch convolutional neural network for hierarchical classification. arXiv preprint arXiv:1709.09890 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Boone Sifuentes, T., Nazari, A., Bouadjenek, M.R., Razzak, I. (2024). MLT-Trans: Multi-level Token Transformer for Hierarchical Image Classification. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14647. Springer, Singapore. https://doi.org/10.1007/978-981-97-2259-4_29
Download citation
DOI: https://doi.org/10.1007/978-981-97-2259-4_29
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2261-7
Online ISBN: 978-981-97-2259-4
eBook Packages: Computer ScienceComputer Science (R0)