Skip to main content
Log in

V\(^2\)MLP: an accurate and simple multi-view MLP network for fine-grained 3D shape recognition

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Fine-grained 3D shape recognition (FGSR) is crucial for real-world applications. Existing methods face challenges in achieving high accuracy for FGSR due to high similarity within sub-categories and low dissimilarity between them, especially in the absence of part location or attribute annotations. In this paper, we propose V\(^2\)MLP, a multi-view representation-oriented MLP network dedicated to FGSR, using only class labels as supervision. V\(^2\)MLP comprises two key modules: the cross-view interaction MLP (CVI-MLP) and the cross-view fusion MLP (CVF-MLP). The CVI-MLP module captures contextual information, including local and global contexts through cross-view interactions, to extract discriminative view features that reinforce subtle differences between sub-categories. Meanwhile, the CVF-MLP module performs cross-view aggregation from space and view dimensions to obtain the final 3D shape features, minimizing information loss during the view feature fusion process. Extensive experiments on three categories from the FG3D dataset demonstrate the effectiveness of V\(^2\)MLP in learning discriminative features for 3D shapes, achieving state-of-the-art accuracy for FGSR. Additionally, V\(^2\)MLP performs competitively for meta-category recognition on the ModelNet40 dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability statement

The data that support the findings of this study are available from the corresponding author, [Jing Bai], upon reasonable request.

References

  1. Xiong, S., Tziafas, G., Kasaei, H.: Enhancing fine-grained 3D object recognition using hybrid multi-modal vision transformer-CNN models. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023) (2023)

  2. Wu, R., Bai, J., Li, W., Jiang, J.: DCNet: exploring fine-grained vision classification for 3D point clouds. Vis. Comput. (2023). https://doi.org/10.1007/s00371-023-02816-y

    Article  Google Scholar 

  3. Shao, H., Bai, J., Wu, R., Jiang, J., Liang, H.: FGPNet: a weakly supervised fine-grained 3D point clouds classification network. Pattern Recogn. 139, 109509 (2023). https://doi.org/10.1016/j.patcog.2023.109509

    Article  Google Scholar 

  4. Liu, X., Han, Z., Liu, Y.-S., Zwicker, M.: Fine-grained 3D shape classification with hierarchical part-view attention. IEEE Trans. Image Process. 30, 1744–1758 (2021)

    Article  Google Scholar 

  5. Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)

  6. Kanezaki, A., Matsushita, Y., Nishida, Y.: RotationNet: joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5010–5019 (2018)

  7. Wei, X., Yu, R., Sun, J.: View-GCN: view-based graph convolutional network for 3D shape analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1850–1859 (2020)

  8. Shilane, P., Min, P., Kazhdan, M., Funkhouser, T.: The Princeton shape benchmark. In: Proceedings Shape Modeling Applications, pp. 167–178 (2004). IEEE

  9. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)

  10. Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based R-CNNs for fine-grained category detection. In: European Conference on Computer Vision, pp. 834–849 (2014). Springer

  11. Zhang, H., Xu, T., Elhoseiny, M., Huang, X., Zhang, S., Elgammal, A., Metaxas, D.: Spda-cnn: Unifying semantic part detection and abstraction for fine-grained recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1143–1152 (2016)

  12. Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., Zhang, Z.: The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 842–850 (2015)

  13. Zhao, B., Wu, X., Feng, J., Peng, Q., Yan, S.: Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimedia 19(6), 1245–1256 (2017)

    Article  Google Scholar 

  14. Liu, F., Zou, C., Deng, X., Zuo, R., Lai, Y.-K., Ma, C., Liu, Y.-J., Wang, H.: Scenesketcher: Fine-grained image retrieval with scene sketches. In: European Conference on Computer Vision, pp. 718–734 (2020). Springer

  15. Lin, T.-Y., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1449–1457 (2015)

  16. Zhu, Y., Liu, G.: Fine-grained action recognition using multi-view attentions. Vis. Comput. 36(9), 1771–1781 (2020)

    Article  Google Scholar 

  17. Lyu, C., Hu, G., Wang, D.: Attention to fine-grained information: hierarchical multi-scale network for retinal vessel segmentation. Vis. Comput. 38, 345–355 (2022)

    Article  Google Scholar 

  18. Li, M., Lei, L., Sun, H., Li, X., Kuang, G.: Fine-grained visual classification via multilayer bilinear pooling with object localization. Vis. Comput. 38, 811–820 (2022)

    Article  Google Scholar 

  19. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)

  20. Maturana, D., Scherer, S.: VoxNet: a 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928 (2015). IEEE

  21. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). IEEE

  22. Savva, M., Yu, F., Su, H., Aono, M., Chen, B., Cohen-Or, D., Deng, W., Su, H., Bai, S., Bai, X., et al.: SHREC’16 track: largescale 3D shape retrieval from ShapeNet Core55. In: Proceedings of the Eurographics Workshop on 3D Object Retrieval, vol. 10 (2016)

  23. Johns, E., Leutenegger, S., Davison, A.J.: Pairwise decomposition of image sequences for active multi-view recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3813–3822 (2016)

  24. Feng, Y., Zhang, Z., Zhao, X., Ji, R., Gao, Y.: Group-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 264–272 (2018)

  25. Yang, Z., Wang, L.: Learning relationships for multi-view 3D object recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7505–7514 (2019)

  26. Yu, T., Meng, J., Yuan, J.: Multi-view harmonized bilinear network for 3D object recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 186–194 (2018)

  27. Chen, S., Zheng, L., Zhang, Y., Sun, Z., Xu, K.: VERAM: view-enhanced recurrent attention model for 3D shape classification. IEEE Trans. Vis. Comput. Graphics 25(12), 3244–3257 (2018)

    Article  Google Scholar 

  28. Dai, G., Xie, J., Fang, Y.: Siamese CNN-BiLSTM architecture for 3D shape representation learning. In: IJCAI, pp. 670–676 (2018)

  29. Ma, C., Guo, Y., Yang, J., An, W.: Learning multi-view representation with LSTM for 3-D shape recognition and retrieval. IEEE Trans. Multimedia 21(5), 1169–1182 (2018)

    Article  Google Scholar 

  30. Tolstikhin, I.O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A., Keysers, D., Uszkoreit, J., et al.: MLP-Mixer: an all-MLP architecture for vision. Ad. Neural Inf. Process. Syst. 34, 24261–24272 (2021)

    Google Scholar 

  31. Liu, H., Dai, Z., So, D., Le, Q.V.: Pay attention to MLPs. Adv. Neural Inf. Process. Syst. 34, 9204–9215 (2021)

    Google Scholar 

  32. Tang, Y., Han, K., Guo, J., Xu, C., Li, Y., Xu, C., Wang, Y.: An image patch is a wave: phase-aware vision MLP. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10935–10944 (2022)

  33. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  34. Guo, M.-H., Cai, J.-X., Liu, Z.-N., Mu, T.-J., Martin, R.R., Hu, S.-M.: PCT: point cloud transformer. Comput. Vis. Media 7(2), 187–199 (2021)

    Article  Google Scholar 

  35. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268 (2021)

  36. Ma, X., Qin, C., You, H., Ran, H., Fu, Y.: Rethinking network design and local geometry in point cloud: a simple residual MLP framework. In: International Conference on Learning Representations (2022)

  37. Han, Z., Lu, H., Liu, Z., Vong, C.-M., Liu, Y.-S., Zwicker, M., Han, J., Chen, C.P.: 3D2SeqViews: aggregating sequential views for 3D global feature learning by CNN with hierarchical attention aggregation. IEEE Trans. Image Process. 28(8), 3986–3999 (2019)

    Article  MathSciNet  Google Scholar 

  38. Han, Z., Wang, X., Vong, C.M., Liu, Y.-S., Zwicker, M., Chen, C.L.P.: 3DViewGraph: learning global features for 3D shapes from a graph of unordered views with attention. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 758–765 (2019). https://doi.org/10.24963/ijcai.2019/107

  39. Han, Z., Shang, M., Liu, Z., Vong, C.-M., Liu, Y.-S., Zwicker, M., Han, J., Chen, C.P.: SeqViews2SeqLabels: learning 3D global features via aggregating sequential views by RNN with attention. IEEE Trans. Image Process. 28(2), 658–672 (2018)

    Article  MathSciNet  Google Scholar 

  40. Han, Z., Liu, X., Liu, Y.-S., Zwicker, M.: Parts4Feature: learning 3D global features from generally semantic parts in multiple views. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 766–773 (2019). https://doi.org/10.24963/ijcai.2019/108

  41. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)

Download references

Funding

This work was supported in part by the National Natural Science Foundation of China (62162001, 61762003), The Natural Science Foundation of Ningxia Province of China (2022AAC02041), The Ningxia Excellent Talent Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jing Bai.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Consent for publication

This manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 A Visualization of complete classification confusion matrix

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zheng, L., Bai, J., Bai, S. et al. V\(^2\)MLP: an accurate and simple multi-view MLP network for fine-grained 3D shape recognition. Vis Comput 40, 6655–6670 (2024). https://doi.org/10.1007/s00371-023-03191-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-03191-4

Keywords