skip to main content
10.1145/3511808.3557369acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Joint Clothes Detection and Attribution Prediction via Anchor-free Framework with Decoupled Representation Transformer

Published: 17 October 2022 Publication History

Abstract

Clothes attribution prediction is the key technology for users to automatically describe clothing characteristics. Most current methods are first to detect the multiple clothes, and then crop out the clothes and feed to a certain network for clothes attribution prediction. But this two-stage approach is time- and resource- consuming; on the other hand, one-stage approach can provide an effective and efficient solution by integrating clothes detection and attribution prediction into an end-to-end framework. But the one-stage approach tends to explore anchor-based detectors causing high sensitivity to the hyperparameters and high computational complexity from dense anchors. In addition, it may also confront with optimization contradiction problem in the training procedure, as the clothes detection and attribution prediction branches demand diverse optimization. In this work, to handle the above problems, we aim to develop an end-to-end anchor-free framework by involving an additional branch for joint clothes detection and attribution prediction. To handle the optimization contradiction in two branches, we encode the backbone feature map as pixel-level dense queries and decode them via deformable transformer as the output features that are fed into detection and prediction branches, respectively. In this way, the features of detection and prediction branches can be decoupled and the optimization contradiction can be naturally solved. To further enhance the prediction accuracy, we in the prediction branch also develop a special attention strategy and loss function to adaptively integrate the peer attribution relationships into feature learning as well as to avoid mutual suppression for hierarchical attributions. Extensive simulation results verify the effectiveness of the proposed work.

Supplementary Material

MP4 File (CIKM22-fp0457.mp4)
This video will illustrate our work on clothes attribute prediction, which is an important basic work for automatic description of clothing and plays an important role in clothing analysis, retrieval, recommendation, voice interaction, etc. We abandon the traditional two-stage approach, i.e. detection first, then cropping, and finally attribute prediction. Through a series of improvements, such as adopting the anchor-free paradigm, introducing feature decoupling, and adding an attention mechanism, we successfully applied an end-to-end network to this task and achieved good results. In the video, we will show you the results of our quantitative and qualitative experiments.

References

[1]
Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934 (2020).
[2]
Jinyu Cai, Jicong Fan, Wenzhong Guo, Shiping Wang, Yunhe Zhang, and Zhao Zhang. 2022. Efficient Deep Embedded Subspace Clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1--10.
[3]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213--229.
[4]
Zeyu Cui, Zekun Li, Shu Wu, Xiao-Yu Zhang, and Liang Wang. 2019. Dressing as a whole: Outfit compatibility learning based on node-wise graph neural networks. In The World Wide Web Conference. 307--317.
[5]
Jicong Fan. 2021. Large-Scale Subspace Clustering via k-Factorization. Association for Computing Machinery, New York, NY, USA.
[6]
Yuying Ge, Ruimao Zhang, Xiaogang Wang, Xiaoou Tang, and Ping Luo. 2019. Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation and re-identification of clothing images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5337--5345.
[7]
Ke Gong, Xiaodan Liang, Dongyu Zhang, Xiaohui Shen, and Liang Lin. 2017. Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 932--940.
[8]
Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven Rennie, and Rogerio Feris. 2019. The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback. arXiv preprint arXiv:1905.12794 (2019).
[9]
Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven Rennie, and Rogerio Feris. 2020. Fashion IQ: A New Dataset towards Retrieving Images by Natural Language Feedback. arXiv preprint arXiv:1905.12794 (2020).
[10]
Xintong Han, Zuxuan Wu, Yu-Gang Jiang, and Larry S Davis. 2017. Learning fashion compatibility with bidirectional lstms. In Proceedings of the 25th ACM international conference on Multimedia. 1078--1086.
[11]
Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. 2018. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7543--7552.
[12]
Ruining He, Charles Packer, and Julian McAuley. 2016. Learning compatibility across categories for heterogeneous item recommendation. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 937--942.
[13]
Mehrdad Hosseinzadeh and Yang Wang. 2020. Composed Query Image Retrieval Using Locally Bounded Features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3596--3605.
[14]
Tomoharu Iwata, Shinji Wanatabe, and Hiroshi Sawada. 2011. Fashion coordinates recommender system using photographs from fashion magazines. In IJCAI, Vol. 22. Citeseer, 2262.
[15]
Surgan Jandial, Ayush Chopra, Pinkesh Badjatiya, Pranit Chawla, Mausoom Sarkar, and Balaji Krishnamurthy. 2020. TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback. arXiv preprint arXiv:2009.01485 (2020).
[16]
Alex Kendall, Yarin Gal, and Roberto Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7482--7491.
[17]
Hei Law and Jia Deng. 2018. Cornernet: Detecting objects as paired keypoints. Proceedings of the European conference on computer vision (ECCV). 734--750.
[18]
Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yidong Li, Terence Sim, Shuicheng Yan, and Jiashi Feng. 2017. Multiple-human parsing in the wild. arXiv preprint arXiv:1705.07206 (2017).
[19]
Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. 2019. Self-Correction for Human Parsing. arXiv preprint arXiv:1910.09777 (2019).
[20]
Yuncheng Li, Liangliang Cao, Jiang Zhu, and Jiebo Luo. 2017. Mining fashion outfit composition using an end-to-end deep learning approach on set data. IEEE Transactions on Multimedia, Vol. 19, 8 (2017), 1946--1955.
[21]
Chao Liang, Zhipeng Zhang, Yi Lu, Xue Zhou, Bing Li, Xiyong Ye, and Jianxiao Zou. 2020. Rethinking the competition between detection and reid in multi-object tracking. arXiv preprint arXiv:2010.12138 (2020).
[22]
Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao Yang, Luoqi Liu, Jian Dong, Liang Lin, and Shuicheng Yan. 2015. Deep human parsing with active template regression. IEEE transactions on pattern analysis and machine intelligence, Vol. 37, 12 (2015), 2402--2414.
[23]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980--2988.
[24]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.
[25]
Yen-Liang Lin, Son Tran, and Larry S Davis. 2020. Fashion Outfit Complementary Item Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3311--3319.
[26]
Jingyuan Liu and Hong Lu. 2018. Deep fashion analysis with feature map upsampling and landmark-driven attention. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 0-0.
[27]
Xin Liu, Jiancheng Li, Jiaqi Wang, and Ziwei Liu. 2021. Mmfashion: An open-source toolbox for visual fashion analysis. In Proceedings of the 29th ACM International Conference on Multimedia. 3755--3758.
[28]
Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016a. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[29]
Ziwei Liu, Sijie Yan, Ping Luo, Xiaogang Wang, and Xiaoou Tang. 2016b. Fashion Landmark Detection in the Wild. In European Conference on Computer Vision (ECCV).
[30]
Junran Peng, Xingyuan Bu, Ming Sun, Zhaoxiang Zhang, Tieniu Tan, and Junjie Yan. 2020. Large-scale object detection in the wild from imbalanced multi-labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9709--9718.
[31]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems. 91--99.
[32]
Yong-Siang Shih, Kai-Yueh Chang, Hsuan-Tien Lin, and Min Sun. 2018. Compatibility family learning for item recommendation and generation. In Thirty-Second AAAI Conference on Artificial Intelligence.
[33]
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2019. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision. 9627--9636.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[35]
Andreas Veit, Balazs Kovacs, Sean Bell, Julian McAuley, Kavita Bala, and Serge Belongie. 2015. Learning visual clothing style with heterogeneous dyadic co-occurrences. In Proceedings of the IEEE International Conference on Computer Vision. 4642--4650.
[36]
Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. 2018. Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European Conference on Computer Vision (ECCV). 589--604.
[37]
Wenguan Wang, Yuanlu Xu, Jianbing Shen, and Song-Chun Zhu. 2018. Attentive fashion grammar network for fashion landmark detection and clothing category classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4271--4280.
[38]
Wenguan Wang, Zhijie Zhang, Siyuan Qi, Jianbing Shen, Yanwei Pang, and Ling Shao. 2019. Learning compositional neural information fusion for human parsing. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5703--5713.
[39]
Wenguan Wang, Hailong Zhu, Jifeng Dai, Yanwei Pang, Jianbing Shen, and Ling Shao. 2020b. Hierarchical human parsing with typed part-relation reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8929--8939.
[40]
Zhongdao Wang, Liang Zheng, Yixuan Liu, Yali Li, and Shengjin Wang. 2020a. Towards real-time multi-object tracking. In European Conference on Computer Vision. Springer, 107--122.
[41]
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV). 3--19.
[42]
Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV). 466--481.
[43]
Yihong Xu, Yutong Ban, Guillaume Delorme, Chuang Gan, Daniela Rus, and Xavier Alameda-Pineda. 2021. Transcenter: Transformers with dense queries for multiple-object tracking. arXiv preprint arXiv:2103.15145 (2021).
[44]
En Yu, Zhuoling Li, Shoudong Han, and Hongwei Wang. 2022. Relationtrack: Relation-aware multiple object tracking with decoupled representation. IEEE Transactions on Multimedia (2022).
[45]
Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. 2018. Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2403--2412.
[46]
Wei Zeng, Mingbo Zhao, Yuan Gao, and Zhao Zhang. 2020. TileGAN: category-oriented attention-based high-quality tiled clothes generation from dressed person. NEURAL COMPUTING & APPLICATIONS (2020).
[47]
Sanyi Zhang, Zhanjie Song, Xiaochun Cao, Hua Zhang, and Jie Zhou. 2019. Task-aware attention model for clothing attribute prediction. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 4 (2019), 1051--1064.
[48]
Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. 2020. Fairmot: On the fairness of detection and re-identification in multiple object tracking. arXiv preprint arXiv:2004.01888 (2020).
[49]
Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, and Wenyu Liu. 2021. Fairmot: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision, Vol. 129, 11 (2021), 3069--3087.
[50]
Mingbo Zhao, Yu Liu, Xianrui Li, Zhao Zhang, and Yue Zhang. 2020. An end-to-end framework for clothing collocation based on semantic feature fusion. IEEE MultiMedia, Vol. 27, 4 (2020), 122--132.
[51]
Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. 2020. Tracking objects as points. In European Conference on Computer Vision. Springer, 474--490.
[52]
Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. 2019. Objects as points. arXiv preprint arXiv:1904.07850 (2019).
[53]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020).

Cited By

View all
  • (2024)Multiview Multilabel Classification With Group-Based Feature and Label SelectionIEEE Transactions on Consumer Electronics10.1109/TCE.2023.327845770:1(3308-3317)Online publication date: Mar-2024
  • (2024)Text-Conditioned Outfit Recommendation With Hybrid Attention LayerIEEE Access10.1109/ACCESS.2023.334693312(281-293)Online publication date: 2024
  • (2024)YOLOPX: Anchor-free multi-task learning network for panoptic driving perceptionPattern Recognition10.1016/j.patcog.2023.110152148(110152)Online publication date: Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management
October 2022
5274 pages
ISBN:9781450392365
DOI:10.1145/3511808
  • General Chairs:
  • Mohammad Al Hasan,
  • Li Xiong
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. anchor-free detectors
  2. clothes attribution prediction
  3. end-to-end learning
  4. fashion analysis

Qualifiers

  • Research-article

Funding Sources

Conference

CIKM '22
Sponsor:

Acceptance Rates

CIKM '22 Paper Acceptance Rate 621 of 2,257 submissions, 28%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)4
Reflects downloads up to 27 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Multiview Multilabel Classification With Group-Based Feature and Label SelectionIEEE Transactions on Consumer Electronics10.1109/TCE.2023.327845770:1(3308-3317)Online publication date: Mar-2024
  • (2024)Text-Conditioned Outfit Recommendation With Hybrid Attention LayerIEEE Access10.1109/ACCESS.2023.334693312(281-293)Online publication date: 2024
  • (2024)YOLOPX: Anchor-free multi-task learning network for panoptic driving perceptionPattern Recognition10.1016/j.patcog.2023.110152148(110152)Online publication date: Apr-2024
  • (2023)FedDAD: Federated Domain Adaptation for Object DetectionIEEE Access10.1109/ACCESS.2023.327913211(51320-51330)Online publication date: 2023
  • (2023)Improving fashion captioning via attribute-based alignment and multi-level language modelApplied Intelligence10.1007/s10489-023-05167-253:24(30803-30821)Online publication date: 25-Nov-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media