ABSTRACT
Recent years have witnessed the booming of online video platforms. Along this line, a graph to illustrate social relation among characters has been long expected to not only benefit the audiences for better understanding the story, but also support the fine-grained video analysis task in a semantic way. Unfortunately, though we humans could easily infer the social relations among characters, it is still an extremely challenging task for intelligent systems to automatically capture the social relation by absorbing multi-modal cues. Besides, they fail to describe the relations among multiple characters in a graph-generation perspective. To that end, inspired by the human inference ability on social relationship, we propose a novel Hierarchical- Cumulative Graph Convolutional Network (HC-GCN) to generate the social relation graph for multiple characters in the video. Specifically, we first integrate the short-term multi-modal cues, including visual, textual and audio information, to generate the frame-level graphs for part of characters via multimodal graph convolution technique. While dealing with the video-level aggregation task, we design an end-to-end framework to aggregate all frame-level subgraphs along the temporal trajectory, which results in a global video-level social graph with various social relationships among multiple characters. Extensive validations on two real-world large-scale datasets demonstrate the effectiveness of our proposed method compared with SOTA baselines.
Supplemental Material
- Hakan Bilen and Andrea Vedaldi. 2016. Weakly Supervised Deep Detection Networks. In CVPR, 2016. 2846--2854.Google Scholar
- Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019 b. Counterfactual Critic Multi-Agent Training for Scene Graph Generation. In ICCV, 2019. 4612--4622.Google Scholar
- Yu Chen, Lingfei Wu, and Mohammed J Zaki. 2019 a. Reinforcement learning based graph-to-sequence model for natural question generation. In ICLR 2020.Google Scholar
- Yu Chen, Lingfei Wu, and Mohammed J Zaki. 2020. Iterative Deep Graph Learning for Graph Neural Networks: Better and Robust Node Embeddings. In Thirty-Fourth annual conference on Neural Information Processing Systems (NeurIPS 2020).Google Scholar
- Andrew C. Gallagher and Tsuhan Chen. 2009. Understanding images of groups of people. In CVPR, 2009. 256--263.Google Scholar
- Arushi Goel, Keng Teck Ma, and Cheston Tan. 2019. An End-To-End Network for Generating Social Relationship Graphs. In CVPR, 2019. 11186--11195.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR, 20166. IEEE Computer Society, 770--778.Google ScholarCross Ref
- Sepp Hochreiter and Jü rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput., Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- Qingqiu Huang, Wentao Liu, and Dahua Lin. 2018. Person Search in Videos with One Portrait Through Visual and Temporal Links. In ECCV, 2018. 437--454.Google Scholar
- Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. 2015. Image retrieval using scene graphs. In CVPR, 2015. 3668--3678.Google Scholar
- Will Kay, Jo a o Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The Kinetics Human Action Video Dataset. CoRR, Vol. abs/1705.06950 (2017).Google Scholar
- Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR, 2017.Google Scholar
- Anna Kukleva, Makarand Tapaswi, and Ivan Laptev. 2020. Learning Interactions and Relationships Between Movie Characters. In CVPR, 2020. 9846--9855.Google Scholar
- Jingjing Li, Ke Lu, Zi Huang, Lei Zhu, and Heng Tao Shen. 2019. Heterogeneous Domain Adaptation Through Progressive Alignment. IEEE Trans. Neural Networks Learn. Syst., Vol. 30, 5 (2019), 1381--1391.Google ScholarCross Ref
- Junnan Li, Yongkang Wong, Qi Zhao, and Mohan S. Kankanhalli. 2017. Dual-Glance Model for Deciphering Social Relationships. In ICCV, 2017. 2669--2678.Google Scholar
- Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. 2016. Gated Graph Sequence Neural Networks. In ICLR, 2016.Google Scholar
- Anan Liu, Yuting Su, Weizhi Nie, and Mohan S. Kankanhalli. 2017. Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, 1 (2017), 102--114. Google ScholarDigital Library
- Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, Chenggang Yan, and Tao Mei. 2019. Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning. In CVPR, 2019. 3566--3574.Google Scholar
- Jinna Lv, Wu Liu, Lili Zhou, Bin Wu, and Huadong Ma. 2018. Multi-stream Fusion Model for Social Relation Recognition from Videos. In MMM, 2018. 355--368.Google Scholar
- Nils Reimers and Iryna Gurevych. 2020. Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. In EMNLP. Association for Computational Linguistics. https://arxiv.org/abs/2004.09813Google Scholar
- Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. [n.d.]. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS, 2015. Google ScholarDigital Library
- Richard Socher, Danqi Chen, Christopher D. Manning, and Andrew Y. Ng. 2013. Reasoning With Neural Tensor Networks for Knowledge Base Completion. In NeurIPS, 2013. 926--934. Google ScholarDigital Library
- Qianru Sun, Bernt Schiele, and Mario Fritz. 2017. A Domain Based Approach to Social Relation Recognition. In CVPR, 2017. 435--444.Google Scholar
- Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In CVPR, 20188. 6450--6459.Google Scholar
- Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition. In ECCV, 2016. 20--36.Google Scholar
- Xiaolong Wang and Abhinav Gupta. 2018. Videos as Space-Time Region Graphs. In ECCV, 2018. 413--431.Google Scholar
- Zhouxia Wang, Tianshui Chen, Jimmy S. J. Ren, Weihao Yu, Hui Cheng, and Liang Lin. 2018a. Deep Reasoning with Knowledge Graph for Social Relationship Understanding. In IJCAI, 2018. 1021--1028. Google ScholarDigital Library
- Zhouxia Wang, Tianshui Chen, Jimmy S. J. Ren, Weihao Yu, Hui Cheng, and Liang Lin. 2018b. Deep Reasoning with Knowledge Graph for Social Relationship Understanding. In IJCAI, 2018. 1021--1028. Google ScholarDigital Library
- Likang Wu, Zhi Li, Hongke Zhao, Qi Liu, and Enhong Chen. 2021 a. Estimating Fund-Raising Performance for Start-up Projects from a Market Graph Perspective. Pattern Recognition (2021).Google Scholar
- Likang Wu, Zhi Li, Hongke Zhao, Qi Liu, Jun Wang, Mengdi Zhang, and Enhong Chen. 2021 b. Learning the Implicit Semantic Representation on Graph-Structured Data. DASFAA (2021).Google Scholar
- Hongtao Xie, Shancheng Fang, Zheng-Jun Zha, Yating Yang, Yan Li, and Yongdong Zhang. 2019. Convolutional Attention Networks for Scene Text Recognition. ACM Trans. Multim. Comput. Commun. Appl., Vol. 15, 1s (2019), 3:1--3:17. Google ScholarDigital Library
- Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. 2017. Scene Graph Generation by Iterative Message Passing. In CVPR, 2017. 3097--3106.Google Scholar
- Ning Xu, An-An Liu, Yongkang Wong, Weizhi Nie, Yuting Su, and Mohan S. Kankanhalli. 2021 a. Scene Graph Inference via Multi-Scale Context Modeling. IEEE Trans. Circuits Syst. Video Technol., Vol. 31, 3 (2021), 1031--1041.Google ScholarCross Ref
- Tong Xu, Peilun Zhou, Linkang Hu, Xiangnan He, Yao Hu, and Enhong Chen. 2021 b. Socializing the Videos: A Multimodal Approach for Social Relation Recognition. In ACM Transactions on Multimedia Computing, Communications, and Applications. Google ScholarDigital Library
- Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph R-CNN for Scene Graph Generation. In ECCV, 2018. 690--706.Google Scholar
- Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural Motifs: Scene Graph Parsing With Global Context. In CVPR, 2018. 5831--5840.Google Scholar
- Ning Zhang, Manohar Paluri, Yaniv Taigman, Rob Fergus, and Lubomir D. Bourdev. 2015b. Beyond frontal faces: Improving Person Recognition using multiple cues. In CVPR, 2015. 4804--4813.Google Scholar
- Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2015a. Learning Social Relation Traits from Face Images. In ICCV, 2015. 3631--3639. Google ScholarDigital Library
Index Terms
- Linking the Characters: Video-oriented Social Graph Generation via Hierarchical-cumulative GCN
Recommendations
Characters as graphs: Interpretable handwritten Chinese character recognition via Pyramid Graph Transformer
Highlights- A novel skeleton graph is proposed to represent handwritten characters.
- A novel ...
AbstractIt is meaningful but challenging to teach machines to recognize handwritten Chinese characters. However, conventional approaches typically view handwritten Chinese characters as either static images or temporal trajectories, which may ...
Overall-Distinctive GCN for Social Relation Recognition on Videos
MultiMedia ModelingAbstractRecognizing social relationships between multiple characters from videos can enable intelligent systems to serve human society better. Previous studies mainly focus on the still image to classify the relationships while ignoring the important data ...
A Trust-Based Privacy-Preserving Friend Recommendation Scheme for Online Social Networks
Online social networks (OSNs), which attract thousands of million people to use everyday, greatly extend OSN users' social circles by friend recommendations. OSN users' existing social relationship can be characterized as 1-hop trust relationship, and ...
Comments