Abstract
Fine-grained 3D shape recognition (FGSR) is crucial for real-world applications. Existing methods face challenges in achieving high accuracy for FGSR due to high similarity within sub-categories and low dissimilarity between them, especially in the absence of part location or attribute annotations. In this paper, we propose V\(^2\)MLP, a multi-view representation-oriented MLP network dedicated to FGSR, using only class labels as supervision. V\(^2\)MLP comprises two key modules: the cross-view interaction MLP (CVI-MLP) and the cross-view fusion MLP (CVF-MLP). The CVI-MLP module captures contextual information, including local and global contexts through cross-view interactions, to extract discriminative view features that reinforce subtle differences between sub-categories. Meanwhile, the CVF-MLP module performs cross-view aggregation from space and view dimensions to obtain the final 3D shape features, minimizing information loss during the view feature fusion process. Extensive experiments on three categories from the FG3D dataset demonstrate the effectiveness of V\(^2\)MLP in learning discriminative features for 3D shapes, achieving state-of-the-art accuracy for FGSR. Additionally, V\(^2\)MLP performs competitively for meta-category recognition on the ModelNet40 dataset.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability statement
The data that support the findings of this study are available from the corresponding author, [Jing Bai], upon reasonable request.
References
Xiong, S., Tziafas, G., Kasaei, H.: Enhancing fine-grained 3D object recognition using hybrid multi-modal vision transformer-CNN models. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023) (2023)
Wu, R., Bai, J., Li, W., Jiang, J.: DCNet: exploring fine-grained vision classification for 3D point clouds. Vis. Comput. (2023). https://doi.org/10.1007/s00371-023-02816-y
Shao, H., Bai, J., Wu, R., Jiang, J., Liang, H.: FGPNet: a weakly supervised fine-grained 3D point clouds classification network. Pattern Recogn. 139, 109509 (2023). https://doi.org/10.1016/j.patcog.2023.109509
Liu, X., Han, Z., Liu, Y.-S., Zwicker, M.: Fine-grained 3D shape classification with hierarchical part-view attention. IEEE Trans. Image Process. 30, 1744–1758 (2021)
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)
Kanezaki, A., Matsushita, Y., Nishida, Y.: RotationNet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5010–5019 (2018)
Wei, X., Yu, R., Sun, J.: View-GCN: view-based graph convolutional network for 3D shape analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1850–1859 (2020)
Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The Princeton shape benchmark. In: Proceedings Shape Modeling Applications, pp. 167–178 (2004). IEEE
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: European Conference on Computer Vision, pp. 834–849 (2014). Springer
Zhang, H., Xu, T., Elhoseiny, M., Huang, X., Zhang, S., Elgammal, A., Metaxas, D.: Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1143–1152 (2016)
Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., Zhang, Z.: The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 842–850 (2015)
Zhao, B., Wu, X., Feng, J., Peng, Q., Yan, S.: Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimedia 19(6), 1245–1256 (2017)
Liu, F., Zou, C., Deng, X., Zuo, R., Lai, Y.-K., Ma, C., Liu, Y.-J., Wang, H.: Scenesketcher: Fine-grained image retrieval with scene sketches. In: European Conference on Computer Vision, pp. 718–734 (2020). Springer
Lin, T.-Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1449–1457 (2015)
Zhu, Y., Liu, G.: Fine-grained action recognition using multi-view attentions. Vis. Comput. 36(9), 1771–1781 (2020)
Lyu, C., Hu, G., Wang, D.: Attention to fine-grained information: hierarchical multi-scale network for retinal vessel segmentation. Vis. Comput. 38, 345–355 (2022)
Li, M., Lei, L., Sun, H., Li, X., Kuang, G.: Fine-grained visual classification via multilayer bilinear pooling with object localization. Vis. Comput. 38, 811–820 (2022)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Maturana, D., Scherer, S.: VoxNet: a 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928 (2015). IEEE
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). IEEE
Savva, M., Yu, F., Su, H., Aono, M., Chen, B., Cohen-Or, D., Deng, W., Su, H., Bai, S., Bai, X., et al.: SHREC’16 track: largescale 3D shape retrieval from ShapeNet Core55. In: Proceedings of the Eurographics Workshop on 3D Object Retrieval, vol. 10 (2016)
Johns, E., Leutenegger, S., Davison, A.J.: Pairwise decomposition of image sequences for active multi-view recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3813–3822 (2016)
Feng, Y., Zhang, Z., Zhao, X., Ji, R., Gao, Y.: Group-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 264–272 (2018)
Yang, Z., Wang, L.: Learning relationships for multi-view 3D object recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7505–7514 (2019)
Yu, T., Meng, J., Yuan, J.: Multi-view harmonized bilinear network for 3D object recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 186–194 (2018)
Chen, S., Zheng, L., Zhang, Y., Sun, Z., Xu, K.: VERAM: view-enhanced recurrent attention model for 3D shape classification. IEEE Trans. Vis. Comput. Graphics 25(12), 3244–3257 (2018)
Dai, G., Xie, J., Fang, Y.: Siamese CNN-BiLSTM architecture for 3D shape representation learning. In: IJCAI, pp. 670–676 (2018)
Ma, C., Guo, Y., Yang, J., An, W.: Learning multi-view representation with LSTM for 3-D shape recognition and retrieval. IEEE Trans. Multimedia 21(5), 1169–1182 (2018)
Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al.: MLP-Mixer: an all-MLP architecture for vision. Ad. Neural Inf. Process. Syst. 34, 24261–24272 (2021)
Liu, H., Dai, Z., So, D., Le, Q.V.: Pay attention to MLPs. Adv. Neural Inf. Process. Syst. 34, 9204–9215 (2021)
Tang, Y., Han, K., Guo, J., Xu, C., Li, Y., Xu, C., Wang, Y.: An image patch is a wave: phase-aware vision MLP. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10935–10944 (2022)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Guo, M.-H., Cai, J.-X., Liu, Z.-N., Mu, T.-J., Martin, R.R., Hu, S.-M.: PCT: point cloud transformer. Comput. Vis. Media 7(2), 187–199 (2021)
Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268 (2021)
Ma, X., Qin, C., You, H., Ran, H., Fu, Y.: Rethinking network design and local geometry in point cloud: a simple residual MLP framework. In: International Conference on Learning Representations (2022)
Han, Z., Lu, H., Liu, Z., Vong, C.-M., Liu, Y.-S., Zwicker, M., Han, J., Chen, C.P.: 3D2SeqViews: aggregating sequential views for 3D global feature learning by CNN with hierarchical attention aggregation. IEEE Trans. Image Process. 28(8), 3986–3999 (2019)
Han, Z., Wang, X., Vong, C.M., Liu, Y.-S., Zwicker, M., Chen, C.L.P.: 3DViewGraph: learning global features for 3D shapes from a graph of unordered views with attention. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 758–765 (2019). https://doi.org/10.24963/ijcai.2019/107
Han, Z., Shang, M., Liu, Z., Vong, C.-M., Liu, Y.-S., Zwicker, M., Han, J., Chen, C.P.: SeqViews2SeqLabels: learning 3D global features via aggregating sequential views by RNN with attention. IEEE Trans. Image Process. 28(2), 658–672 (2018)
Han, Z., Liu, X., Liu, Y.-S., Zwicker, M.: Parts4Feature: learning 3D global features from generally semantic parts in multiple views. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 766–773 (2019). https://doi.org/10.24963/ijcai.2019/108
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
Funding
This work was supported in part by the National Natural Science Foundation of China (62162001, 61762003), The Natural Science Foundation of Ningxia Province of China (2022AAC02041), The Ningxia Excellent Talent Program.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Consent for publication
This manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 A Visualization of complete classification confusion matrix
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zheng, L., Bai, J., Bai, S. et al. V\(^2\)MLP: an accurate and simple multi-view MLP network for fine-grained 3D shape recognition. Vis Comput 40, 6655–6670 (2024). https://doi.org/10.1007/s00371-023-03191-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-023-03191-4