skip to main content
10.1145/3664647.3680570acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

CAPNet: Cartoon Animal Parsing with Spatial Learning and Structural Modeling

Published: 28 October 2024 Publication History

Abstract

Cartoon animal parsing aims to segment the body parts such as heads, arms, legs and tails of cartoon animals. Different from previous parsing tasks, cartoon animal parsing faces new challenges, including irregular body structures, abstract drawing styles and diverse animal categories. Existing methods have difficulties when addressing these challenges caused by the spatial and structural properties of cartoon animals. To address these challenges, a novel spatial learning and structural modeling network, named CAPNet, is proposed for cartoon animal parsing. It aims to address the critical problems of spatial perception, structure modeling and spatial-structural consistency learning. A spatial-aware learning module integrates deformable convolutions to learn spatial features of diverse cartoon animals. The multi-task edge and center point prediction mechanism is incorporated to capture the intricate spatial patterns. A structural modeling method is proposed to model the complex structural representations of cartoon animals, which integrates a graph neural network with a shape-aware relation learning module. To mitigate the significant differences among animals, a spatial and structural consistency learning strategy is proposed to capture and learn feature correlations across different animal species. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed approach, which outperforms the state-of-the-art methods.

References

[1]
Jinming Cao, Hanchao Leng, Dani Lischinski, Danny Cohen-Or, Changhe Tu, and Yangyan Li. 2021. ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation. In IEEE International Conference on Computer Vision. 7068--7077.
[2]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2017. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 4 (2017), 834--848.
[3]
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In European Conference on Computer Vision. 833--851.
[4]
Weihua Chen, Xianzhe Xu, Jian Jia, Hao Luo, Yaohua Wang, Fan Wang, Rong Jin, and Xiuyu Sun. 2023. Beyond Appearance: A Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks. In IEEE Conference on Computer Vision and Pattern Recognition. 15050--15061.
[5]
Zhi-Qi Cheng, Qi Dai, Siyao Li, Teruko Mitamura, and Alexander Hauptmann. 2022. GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement. In ACM International Conference on Multimedia. 3272--3281.
[6]
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The Cityscapes Dataset for Semantic Urban Scene Understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 3213--3223.
[7]
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable Convolutional Networks. In IEEE International Conference on Computer Vision. 764--773.
[8]
Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu. 2019. Dual Attention Network for Scene Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. 3146--3154.
[9]
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural Message Passing for Quantum Chemistry. In IEEE International Conference on Machine Learning. 1263--1272.
[10]
Ke Gong, Xiaodan Liang, Dongyu Zhang, Xiaohui Shen, and Liang Lin. 2017. Look into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing. In IEEE Conference on Computer Vision and Pattern Recognition. 6757--6765.
[11]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision. 2980--2988.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[13]
Yixuan He, Quan Gan, David Wipf, Gesine D. Reinert, Junchi Yan, and Mihai Cucuringu. 2022. GNNRank: Learning Global Rankings from Pairwise Comparisons via Directed Graph Neural Networks. In IEEE International Conference on Machine Learning. 8581--8612.
[14]
Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. 2019. Local Relation Networks for Image Recognition. In IEEE International Conference on Computer Vision. 3463--3472.
[15]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 7132--7141.
[16]
Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. CCNet: Criss-Cross Attention for Semantic Segmentation. In IEEE International Conference on Computer Vision. 603--612.
[17]
Ruyi Ji, Dawei Du, Libo Zhang, Longyin Wen, Yanjun Wu, Chen Zhao, Feiyue Huang, and Siwei Lyu. 2020. Learning Semantic Neural Tree for Human Parsing. In European Conference on Computer Vision. 205--221.
[18]
Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. 2022. Self-Correction for Human Parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 6 (2022), 3260--3271.
[19]
Wanyu Lin, Hao Lan, Hao Wang, and Baochun Li. 2022. OrphicX: A Causality-Inspired Latent Variable Model for Interpreting Graph Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 13719--13728.
[20]
Kunliang Liu, Ouk Choi, Jianming Wang, and Wonjun Hwang. 2022. CDGNet: Class Distribution Guided Network for Human Parsing. In IEEE Conference on Computer Vision and Pattern Recognition. 4463--4472.
[21]
Yawei Luo, Zhedong Zheng, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. 2018. Macro-Micro Adversarial Network for Human Parsing. In European Conference on Computer Vision. 424--440.
[22]
Xuecheng Nie, Jiashi Feng, and Shuicheng Yan. 2018. Mutual Learning to Adapt for Joint Human Parsing and Pose Estimation. In European Conference on Computer Vision. 519--534.
[23]
Yinhua Piao, Sangseon Lee, Dohoon Lee, and Sun Kim. 2022. Sparse Structure Learning via Graph Neural Networks for Inductive Document Classification. In Association for the Advancement of Artificial Intelligence. 11165--11173.
[24]
Jian-Jun Qiao, Zhi-Qi Cheng, Xiao Wu, Wei Li, and Ji Zhang. 2022. Real-time Semantic Segmentation with Parallel Multiple Views Feature Augmentation. In ACM International Conference on Multimedia. 6300--6308.
[25]
Jian-Jun Qiao, Xiao Wu, Jun-Yan He, Wei Li, and Qiang Peng. 2022. SWNet: A Deep Learning Based Approach for Splashed Water Detection on Road. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, 4 (2022), 3012--3025.
[26]
Jian-Jun Qiao, Jie Zhang, Xiao Wu, Yu-Pei Song, and Wei Li. 2023. CPNet: Cartoon Parsing with Pixel and Part Correlation. In ACM International Conference on Multimedia. 6888--6897.
[27]
Tao Ruan, Ting Liu, Zilong Huang, Yunchao Wei, Shikui Wei, and Yao Zhao. 2019. Devil in the Details: Towards Accurate Single and Multiple Human Parsing. In Association for the Advancement of Artificial Intelligence. 4814--4821.
[28]
Simon Schaefer, Daniel Gehrig, and Davide Scaramuzza. 2022. AEGNN: Asynchronous Event-based Graph Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 12361--12371.
[29]
Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Advances in Neural Information Processing Systems. 802--810.
[30]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations.
[31]
Shixiang Tang, Cheng Chen, Qingsong Xie, Meilin Chen, Yizhou Wang, Yuanzheng Ci, Lei Bai, Feng Zhu, Haiyang Yang, Li Yi, Rui Zhao, and Wanli Ouyang. 2023. HumanBench: Towards General Human-Centric Perception with Projector Assisted Pretraining. In IEEE Conference on Computer Vision and Pattern Recognition. 21970--21982.
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems. 5998--6008.
[33]
Jerome Wan, Guillaume Mougeot, and Xubo Yang. 2020. Dense Feature Pyramid Network for Cartoon Dog Parsing. The Visual Computer, Vol. 36, 10 (2020), 2471--2483.
[34]
Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. 2021. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 10 (2021), 3349--3364.
[35]
Wenguan Wang, Zhijie Zhang, Siyuan Qi, Jianbing Shen, Yanwei Pang, and Ling Shao. 2019. Learning Compositional Neural Information Fusion for Human Parsing. In IEEE International Conference on Computer Vision. 5702--5712.
[36]
Wenguan Wang, Hailong Zhu, Jifeng Dai, Yanwei Pang, Jianbing Shen, and Ling Shao. 2020. Hierarchical Human Parsing with Typed Part-Relation Reasoning. In IEEE Conference on Computer Vision and Pattern Recognition. 8926--8936.
[37]
Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.
[38]
Sanghyun Woo, Jongchan Park, Lee Joon-Young, and In So Kweon. 2018. CBAM: Convolutional Block Attention Module. In European Conference on Computer Vision. 3--19.
[39]
Huisi Wu, Yilin Wu, Shenglong Zhang, Ping Li, and Zhenkun Wen. 2017. Cartoon Image Segmentation based on Improved SLIC Superpixels and Adaptive Region Propagation Merging. In IEEE International Conference on Signal and Image Processing.
[40]
Lingxiao Yang, Ru-Yuan Zhang, Lida Li, and Xiaohua Xie. 2021. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In IEEE International Conference on Machine Learning. 11863--11874.
[41]
Yuhui Yuan, Xilin Chen, and Jingdong Wang. 2020. Object-Contextual Representations for Semantic Segmentation. In IEEE International Conference on Computer Vision. 173--190.
[42]
Dan Zeng, Yuhang Huang, Qian Bao, Junjie Zhang, Chi Su, and Wu Liu. 2021. Neural Architecture Search for Joint Human Parsing and Pose Estimation. In IEEE International Conference on Computer Vision. 11365--11374.
[43]
Ji Zhang, Xiao Wu, Zhi-Qi Cheng, Qi He, and Wei Li. 2023. Improving Anomaly Segmentation with Multi-Granularity Cross-Domain Alignment. In ACM International Conference on Multimedia. 8515--8524.
[44]
Xiaomei Zhang, Yingying Chen, Bingke Zhu, Jinqiao Wang, and Ming Tang. 2020. Blended Grammar Network for Human Parsing. In European Conference on Computer Vision. 189--205.
[45]
Ziwei Zhang, Chi Su, Liang Zheng, and Xiaodong Xie. 2020. Correlating Edge, Pose with Parsing. In IEEE Conference on Computer Vision and Pattern Recognition. 8897--8906.
[46]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid Scene Parsing Network. In IEEE Conference on Computer Vision and Pattern Recognition. 6230--6239.

Index Terms

  1. CAPNet: Cartoon Animal Parsing with Spatial Learning and Structural Modeling

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cartoon animal parsing
    2. deformable convolution
    3. graph neural network
    4. spatial learning
    5. structural modeling

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 32
      Total Downloads
    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media