research-article

CAPNet: Cartoon Animal Parsing with Spatial Learning and Structural Modeling

Authors:

Wei LiAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 9809 - 9817

https://doi.org/10.1145/3664647.3680570

Published: 28 October 2024 Publication History

Abstract

Cartoon animal parsing aims to segment the body parts such as heads, arms, legs and tails of cartoon animals. Different from previous parsing tasks, cartoon animal parsing faces new challenges, including irregular body structures, abstract drawing styles and diverse animal categories. Existing methods have difficulties when addressing these challenges caused by the spatial and structural properties of cartoon animals. To address these challenges, a novel spatial learning and structural modeling network, named CAPNet, is proposed for cartoon animal parsing. It aims to address the critical problems of spatial perception, structure modeling and spatial-structural consistency learning. A spatial-aware learning module integrates deformable convolutions to learn spatial features of diverse cartoon animals. The multi-task edge and center point prediction mechanism is incorporated to capture the intricate spatial patterns. A structural modeling method is proposed to model the complex structural representations of cartoon animals, which integrates a graph neural network with a shape-aware relation learning module. To mitigate the significant differences among animals, a spatial and structural consistency learning strategy is proposed to capture and learn feature correlations across different animal species. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed approach, which outperforms the state-of-the-art methods.

References

[1]

Jinming Cao, Hanchao Leng, Dani Lischinski, Danny Cohen-Or, Changhe Tu, and Yangyan Li. 2021. ShapeConv: Shape-aware Convolutional Layer for Indoor RGB-D Semantic Segmentation. In IEEE International Conference on Computer Vision. 7068--7077.

[2]

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. 2017. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 4 (2017), 834--848.

[3]

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In European Conference on Computer Vision. 833--851.

[4]

Weihua Chen, Xianzhe Xu, Jian Jia, Hao Luo, Yaohua Wang, Fan Wang, Rong Jin, and Xiuyu Sun. 2023. Beyond Appearance: A Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks. In IEEE Conference on Computer Vision and Pattern Recognition. 15050--15061.

[5]

Zhi-Qi Cheng, Qi Dai, Siyao Li, Teruko Mitamura, and Alexander Hauptmann. 2022. GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement. In ACM International Conference on Multimedia. 3272--3281.

[6]

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The Cityscapes Dataset for Semantic Urban Scene Understanding. In IEEE Conference on Computer Vision and Pattern Recognition. 3213--3223.

[7]

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable Convolutional Networks. In IEEE International Conference on Computer Vision. 764--773.

[8]

Jun Fu, Jing Liu, Haijie Tian, Zhiwei Fang, and Hanqing Lu. 2019. Dual Attention Network for Scene Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition. 3146--3154.

[9]

Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural Message Passing for Quantum Chemistry. In IEEE International Conference on Machine Learning. 1263--1272.

[10]

Ke Gong, Xiaodan Liang, Dongyu Zhang, Xiaohui Shen, and Liang Lin. 2017. Look into Person: Self-Supervised Structure-Sensitive Learning and a New Benchmark for Human Parsing. In IEEE Conference on Computer Vision and Pattern Recognition. 6757--6765.

[11]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision. 2980--2988.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition. 770--778.

[13]

Yixuan He, Quan Gan, David Wipf, Gesine D. Reinert, Junchi Yan, and Mihai Cucuringu. 2022. GNNRank: Learning Global Rankings from Pairwise Comparisons via Directed Graph Neural Networks. In IEEE International Conference on Machine Learning. 8581--8612.

[14]

Han Hu, Zheng Zhang, Zhenda Xie, and Stephen Lin. 2019. Local Relation Networks for Image Recognition. In IEEE International Conference on Computer Vision. 3463--3472.

[15]

Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-Excitation Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 7132--7141.

[16]

Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. CCNet: Criss-Cross Attention for Semantic Segmentation. In IEEE International Conference on Computer Vision. 603--612.

[17]

Ruyi Ji, Dawei Du, Libo Zhang, Longyin Wen, Yanjun Wu, Chen Zhao, Feiyue Huang, and Siwei Lyu. 2020. Learning Semantic Neural Tree for Human Parsing. In European Conference on Computer Vision. 205--221.

[18]

Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. 2022. Self-Correction for Human Parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 44, 6 (2022), 3260--3271.

[19]

Wanyu Lin, Hao Lan, Hao Wang, and Baochun Li. 2022. OrphicX: A Causality-Inspired Latent Variable Model for Interpreting Graph Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 13719--13728.

[20]

Kunliang Liu, Ouk Choi, Jianming Wang, and Wonjun Hwang. 2022. CDGNet: Class Distribution Guided Network for Human Parsing. In IEEE Conference on Computer Vision and Pattern Recognition. 4463--4472.

[21]

Yawei Luo, Zhedong Zheng, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. 2018. Macro-Micro Adversarial Network for Human Parsing. In European Conference on Computer Vision. 424--440.

[22]

Xuecheng Nie, Jiashi Feng, and Shuicheng Yan. 2018. Mutual Learning to Adapt for Joint Human Parsing and Pose Estimation. In European Conference on Computer Vision. 519--534.

Digital Library

[23]

Yinhua Piao, Sangseon Lee, Dohoon Lee, and Sun Kim. 2022. Sparse Structure Learning via Graph Neural Networks for Inductive Document Classification. In Association for the Advancement of Artificial Intelligence. 11165--11173.

[24]

Jian-Jun Qiao, Zhi-Qi Cheng, Xiao Wu, Wei Li, and Ji Zhang. 2022. Real-time Semantic Segmentation with Parallel Multiple Views Feature Augmentation. In ACM International Conference on Multimedia. 6300--6308.

[25]

Jian-Jun Qiao, Xiao Wu, Jun-Yan He, Wei Li, and Qiang Peng. 2022. SWNet: A Deep Learning Based Approach for Splashed Water Detection on Road. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, 4 (2022), 3012--3025.

Digital Library

[26]

Jian-Jun Qiao, Jie Zhang, Xiao Wu, Yu-Pei Song, and Wei Li. 2023. CPNet: Cartoon Parsing with Pixel and Part Correlation. In ACM International Conference on Multimedia. 6888--6897.

[27]

Tao Ruan, Ting Liu, Zilong Huang, Yunchao Wei, Shikui Wei, and Yao Zhao. 2019. Devil in the Details: Towards Accurate Single and Multiple Human Parsing. In Association for the Advancement of Artificial Intelligence. 4814--4821.

[28]

Simon Schaefer, Daniel Gehrig, and Davide Scaramuzza. 2022. AEGNN: Asynchronous Event-based Graph Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 12361--12371.

[29]

Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. 2015. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Advances in Neural Information Processing Systems. 802--810.

[30]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations.

[31]

Shixiang Tang, Cheng Chen, Qingsong Xie, Meilin Chen, Yizhou Wang, Yuanzheng Ci, Lei Bai, Feng Zhu, Haiyang Yang, Li Yi, Rui Zhao, and Wanli Ouyang. 2023. HumanBench: Towards General Human-Centric Perception with Projector Assisted Pretraining. In IEEE Conference on Computer Vision and Pattern Recognition. 21970--21982.

[32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems. 5998--6008.

[33]

Jerome Wan, Guillaume Mougeot, and Xubo Yang. 2020. Dense Feature Pyramid Network for Cartoon Dog Parsing. The Visual Computer, Vol. 36, 10 (2020), 2471--2483.

Digital Library

[34]

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. 2021. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 10 (2021), 3349--3364.

[35]

Wenguan Wang, Zhijie Zhang, Siyuan Qi, Jianbing Shen, Yanwei Pang, and Ling Shao. 2019. Learning Compositional Neural Information Fusion for Human Parsing. In IEEE International Conference on Computer Vision. 5702--5712.

[36]

Wenguan Wang, Hailong Zhu, Jifeng Dai, Yanwei Pang, Jianbing Shen, and Ling Shao. 2020. Hierarchical Human Parsing with Typed Part-Relation Reasoning. In IEEE Conference on Computer Vision and Pattern Recognition. 8926--8936.

[37]

Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.

[38]

Sanghyun Woo, Jongchan Park, Lee Joon-Young, and In So Kweon. 2018. CBAM: Convolutional Block Attention Module. In European Conference on Computer Vision. 3--19.

[39]

Huisi Wu, Yilin Wu, Shenglong Zhang, Ping Li, and Zhenkun Wen. 2017. Cartoon Image Segmentation based on Improved SLIC Superpixels and Adaptive Region Propagation Merging. In IEEE International Conference on Signal and Image Processing.

[40]

Lingxiao Yang, Ru-Yuan Zhang, Lida Li, and Xiaohua Xie. 2021. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In IEEE International Conference on Machine Learning. 11863--11874.

[41]

Yuhui Yuan, Xilin Chen, and Jingdong Wang. 2020. Object-Contextual Representations for Semantic Segmentation. In IEEE International Conference on Computer Vision. 173--190.

[42]

Dan Zeng, Yuhang Huang, Qian Bao, Junjie Zhang, Chi Su, and Wu Liu. 2021. Neural Architecture Search for Joint Human Parsing and Pose Estimation. In IEEE International Conference on Computer Vision. 11365--11374.

[43]

Ji Zhang, Xiao Wu, Zhi-Qi Cheng, Qi He, and Wei Li. 2023. Improving Anomaly Segmentation with Multi-Granularity Cross-Domain Alignment. In ACM International Conference on Multimedia. 8515--8524.

[44]

Xiaomei Zhang, Yingying Chen, Bingke Zhu, Jinqiao Wang, and Ming Tang. 2020. Blended Grammar Network for Human Parsing. In European Conference on Computer Vision. 189--205.

[45]

Ziwei Zhang, Chi Su, Liang Zheng, and Xiaodong Xie. 2020. Correlating Edge, Pose with Parsing. In IEEE Conference on Computer Vision and Pattern Recognition. 8897--8906.

[46]

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid Scene Parsing Network. In IEEE Conference on Computer Vision and Pattern Recognition. 6230--6239.

Index Terms

CAPNet: Cartoon Animal Parsing with Spatial Learning and Structural Modeling
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation

Recommendations

CartoonNet: Cartoon Parsing with Semantic Consistency and Structure Correlation
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Cartoon parsing is an important task for cartoon-centric applications, which segments the body parts of cartoon images. Due to the complex appearances, abstract drawing styles, and irregular structures of cartoon characters, cartoon parsing remains a ...
CPNet: Cartoon Parsing with Pixel and Part Correlation
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Cartoon parsing, the task of segmenting constituent parts such as heads, arms, and legs of cartoon characters, holds substantial significance for applications in the animation industry and emerging metaverse. Nonetheless, this domain presents ...
A novel un-supervised burst time dependent plasticity learning approach for biologically pattern recognition networks
Abstract
Bio-inspired computing is an appropriate platform for developing artificial intelligent machines based on the behavioral and functional principles of the brain. Bio-inspired machines have been proven to play a significant role in the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Key R&D Program of Guangxi Zhuang Autonomous Region, China
National Natural Science Foundation of China
Fundamental Research Funds for the Central Universities
China Postdoctoral Science Foundation
Sichuan Science and Technology Program

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
32
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)7

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten