skip to main content
research-article

Temporal Dynamic Concept Modeling Network for Explainable Video Event Recognition

Published: 12 July 2023 Publication History

Abstract

Recently, with the vigorous development of deep learning and multimedia technology, intelligent urban computing has received more and more extensive attention from academia and industry. Unfortunately, most of the related technologies are black-box paradigms that lack interpretability. Among them, video event recognition is a basic technology. Event contains multiple concepts and their rich interactions, which can assist us to construct explainable event recognition methods. However, the crucial concepts needed to recognize events have various temporal existing patterns, and the relationship between events and the temporal characteristics of concepts has not been fully exploited. This brings great challenges for concept-based event categorization. To address the above issues, we introduce the temporal concept receptive field, which is the length of the temporal window size required to capture key concepts for concept-based event recognition methods. Accordingly, we introduce the temporal dynamic convolution (TDC) to model the temporal concept receptive field dynamically according to different events. Its core idea is to combine the results of multiple convolution layers with the learned coefficients from two complementary perspectives. These convolution layers contain a variety of kernel sizes, which can provide temporal concept receptive fields of different lengths. Similarly, we also propose the cross-domain temporal dynamic convolution (CrTDC) with the help of the rich relationship between different concepts. Different coefficients can help us to capture suitable temporal concept receptive field sizes and highlight crucial concepts to obtain accurate and complete concept representations for event analysis. Based on the TDC and CrTDC, we introduce the temporal dynamic concept modeling network (TDCMN) for explainable video event recognition. We evaluate TDCMN on large-scale and challenging datasets FCVID, ActivityNet, and CCV. Experimental results show that TDCMN significantly improves the event recognition performance of concept-based methods, and the explainability of our method inspires us to construct more explainable models from the perspective of the temporal concept receptive field.

References

[1]
Kashif Ahmad and Nicola Conci. 2019. How deep features have improved event recognition in multimedia: A survey. ACM Trans. Multim. Comput., Commun. Applic. 15, 2 (2019), 1–27.
[2]
Kashif Ahmad, Mohamed Lamine Mekhalfi, Nicola Conci, Farid Melgani, and Francesco De Natale. 2018. Ensemble of deep models for event recognition. ACM Trans. Multim. Comput., Commun. Applic. 14, 2 (2018), 1–20.
[3]
Subhabrata Bhattacharya, Mahdi M. Kalayeh, Rahul Sukthankar, and Mubarak Shah. 2014. Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2235–2242.
[4]
Egor Burkov and Victor S. Lempitsky. 2018. Deep neural networks with box convolutions. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 6214–6224.
[5]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. ActivityNet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961–970.
[6]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
[7]
Xiaojun Chang, Yao-Liang Yu, Yi Yang, and Alexander G. Hauptmann. 2015. Searching persuasively: Joint event detection and evidence recounting with limited supervision. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 581–590.
[8]
Xiaojun Chang, Yao-Liang Yu, Yi Yang, and Eric P. Xing. 2016. They are not equally reliable: Semantic event search using differentiated concept classifiers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1884–1893.
[9]
Xiaojun Chang, Yao-Liang Yu, Yi Yang, and Eric P. Xing. 2017. Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans. Pattern Anal. Mach. Intell. 39, 8 (2017), 1617–1632.
[10]
Jiawei Chen, Yin Cui, Guangnan Ye, Dong Liu, and Shih-Fu Chang. 2014. Event-driven semantic concept discovery by exploiting weakly tagged internet images. In ACM International Conference on Multimedia Retrieval. 1.
[11]
Weidong Chen, Dexiang Hong, Yuankai Qi, Zhenjun Han, Shuhui Wang, Laiyun Qing, Qingming Huang, and Guorong Li. 2022. Multi-attention network for compressed video referring object segmentation. In Proceedings of the 30th ACM International Conference on Multimedia. 4416–4425.
[12]
Weidong Chen, Guorong Li, Xinfeng Zhang, Shuhui Wang, Liang Li, and Qingming Huang. 2022. Weakly supervised text-based actor-action video segmentation by clip-level multi-instance learning. ACM Transactions on Multimedia Computing, Communications, and Applications.
[13]
Weidong Chen, Guorong Li, Xinfeng Zhang, Hongyang Yu, Shuhui Wang, and Qingming Huang. 2021. Cascade cross-modal attention network for video actor and action segmentation from a sentence. In Proceedings of the 29th ACM International Conference on Multimedia. 4053–4062.
[14]
Xiaodong Chen, Xinchen Liu, Wu Liu, Xiaoping Zhang, Yongdong Zhang, and Tao Mei. 2021. Explainable person re-identification with attribute-guided metric distillation. In Proceedings of the IEEE International Conference on Computer Vision. 11793–11802.
[15]
Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, and Zicheng Liu. 2020. Dynamic convolution: Attention over convolution kernels. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11027–11036.
[16]
Hehe Fan, Xiaojun Chang, De Cheng, Yi Yang, Dong Xu, and Alexander G. Hauptmann. 2017. Complex event detection by identifying reliable shots from untrimmed videos. In Proceedings of the IEEE International Conference on Computer Vision. 736–744.
[17]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6202–6211.
[18]
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1933–1941.
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[20]
Dang Ha The Hien. 2017. A guide to receptive field arithmetic for convolutional neural networks. https://syncedreview.com/2017/05/11/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks/.
[21]
Yi Huang, Xiaoshan Yang, Junyu Gao, Jitao Sang, and Changsheng Xu. 2020. Knowledge-driven egocentric multimodal activity recognition. ACM Trans. Multim. Comput., Commun. Applic. 16, 4 (2020), 1–133.
[22]
Hamid Izadinia and Mubarak Shah. 2012. Recognizing complex events using large margin joint low-level event model. In Proceedings of the European Conference on Computer Vision. Springer, 430–444.
[23]
Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc Van Gool. 2016. Dynamic filter networks. In Advances in Neural Information Processing Systems. 667–675.
[24]
Yu-Gang Jiang, Qi Dai, Jun Wang, Chong-Wah Ngo, Xiangyang Xue, and Shih-Fu Chang. 2012. Fast semantic diffusion for large-scale context-based image and video annotation. IEEE Trans. Image Process. 21, 6 (2012), 3080–3091.
[25]
Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2018. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2 (2018), 352–364.
[26]
Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval. ACM, 29.
[27]
Sunghun Kang, Junyeong Kim, Hyunsoo Choi, Sungjin Kim, and Chang D. Yoo. 2018. Pivot correlational neural network for multimodal video categorization. In Proceedings of the European Conference on Computer Vision (ECCV’18). 386–401.
[28]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1725–1732.
[29]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
[30]
Marius Kloft, Ulf Brefeld, Sören Sonnenburg, and Alexander Zien. 2011. Lp-norm multiple kernel learning. J. Mach. Learn. Res. 12, Mar. (2011), 953–997.
[31]
Bruno Korbar, Du Tran, and Lorenzo Torresani. 2019. SCSampler: Sampling salient clips from video for efficient action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6232–6242.
[32]
Kuan-Ting Lai, Dong Liu, Ming-Syan Chen, and Shih-Fu Chang. 2014. Recognizing complex events in videos by learning key static-dynamic evidences. In Proceedings of the European Conference on Computer Vision. Springer, 675–688.
[33]
Chao Li, Zi Huang, Yang Yang, Jiewei Cao, Xiaoshuai Sun, and Heng Tao Shen. 2017. Hierarchical latent concept discovery for video event detection. IEEE Trans. Image Process. 26, 5 (2017), 2149–2162.
[34]
Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. 2019. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 510–519.
[35]
Yanghao Li, Yuntao Chen, Naiyan Wang, and Zhao-Xiang Zhang. 2019. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6053–6062.
[36]
Yonggang Li, Chunping Liu, Yi Ji, Shengrong Gong, and Haibao Xu. 2020. Spatio-temporal deep residual network with hierarchical attentions for video event recognition. ACM Trans. Multim. Comput., Commun. Applic. 16, 2s (2020), 1–21.
[37]
Zechao Li, Yanpeng Sun, Liyan Zhang, and Jinhui Tang. 2022. CTNet: Context-based tandem network for semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 12 (2022), 9904–9917.
[38]
Zechao Li, Jinhui Tang, and Tao Mei. 2018. Deep collaborative embedding for social image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 41, 9 (2018), 2070–2083.
[39]
Zechao Li, Jinhui Tang, Xueming Wang, Jing Liu, and Hanqing Lu. 2016. Multimedia news summarization in search. ACM Trans. Intell. Syst. Technol. 7, 3 (2016), 1–20.
[40]
Vasileios Lioutas and Yuhong Guo. 2020. Time-aware large kernel convolutions. In Proceedings of the International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 119). 6172–6183.
[41]
Kun Liu, Wu Liu, Chuang Gan, Mingkui Tan, and Huadong Ma. 2018. T-C3D: Temporal convolutional 3D network for real-time action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence.
[42]
Kun Liu, Wu Liu, Huadong Ma, Wenbing Huang, and Xiongxiong Dong. 2019. Generalized zero-shot learning for action recognition with web-scale video data. World Wide Web 22, 2 (2019), 807–824.
[43]
Kun Liu, Wu Liu, Huadong Ma, Mingkui Tan, and Chuang Gan. 2020. A real-time action representation with temporal encoding and deep compression. IEEE Trans. Circ. Syst. Vid. Technol. 31, 2 (2020), 647–660.
[44]
Wu Liu and Tao Mei. 2022. Recent advances of monocular 2D and 3D human pose estimation: A deep learning perspective.
[45]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video Swin Transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3192–3201.
[46]
Yue Meng, Chung-Ching Lin, Rameswar Panda, Prasanna Sattigeri, Leonid Karlinsky, Aude Oliva, Kate Saenko, and Rogerio Feris. 2020. AR-Net: Adaptive frame resolution for efficient action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 86–104.
[47]
Yue Meng, Rameswar Panda, Chung-Ching Lin, Prasanna Sattigeri, Leonid Karlinsky, Kate Saenko, Aude Oliva, and Rogério Feris. 2021. AdaFuse: Adaptive temporal fusion network for efficient action recognition. In Proceedings of the International Conference on Learning Representations.
[48]
Markus Nagel, Thomas Mensink, and Cees G. M. Snoek. 2015. Event Fisher vectors: Robust encoding visual diversity of visual streams. In Proceedings of the British Machine Vision Conference. 178.1–178.12.
[49]
Yingwei Pan, Yehao Li, Ting Yao, Tao Mei, Houqiang Li, and Yong Rui. 2016. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In Proceedings of the International Joint Conference on Artificial Intelligence. Citeseer, 3832–3838.
[50]
Wenjie Pei, Tadas Baltrusaitis, David M. J. Tax, and Louis-Philippe Morency. 2017. Temporal attention-gated model for robust sequence classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6730–6739.
[51]
Zhimao Peng, Zechao Li, Junge Zhang, Yan Li, Guo-Jun Qi, and Jinhui Tang. 2019. Few-shot image recognition with knowledge transfer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 441–449.
[52]
Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. 2020. Towards more explainability: Concept knowledge mining network for event recognition. In The 28th ACM International Conference on Multimedia. 3857–3865.
[53]
Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Qingming Huang, and Qi Tian. 2021. Self-regulated learning for egocentric video activity anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1–1.
[54]
Zhaobo Qi, Shuhui Wang, Chi Su, Li Su, Weigang Zhang, and Qingming Huang. 2020. Modeling temporal concept receptive field dynamically for untrimmed video analysis. In The 28th ACM International Conference on Multimedia. 3798–3806.
[55]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the IEEE International Conference on Computer Vision. 5533–5541.
[56]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein et al. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211–252.
[57]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 568–576.
[58]
Bharat Singh, Xintong Han, Zhe Wu, Vlad I. Morariu, and Larry S. Davis. 2015. Selecting relevant web trained concepts for automated event retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 4561–4569.
[59]
John R. Smith, Milind R. Naphade, and Apostol Natsev. 2003. Multimedia semantic indexing using model vectors. In Proceedings of the IEEE International Conference on Multimedia and Expo. 445–448.
[60]
Mohammad Soltanian, Sajjad Amini, and Shahrokh Ghaemmaghami. 2019. Spatio-temporal VLAD encoding of visual events using temporal ordering of the mid-level deep semantics. IEEE Trans. Multim. 22, 7 (2019), 1769–1784.
[61]
Mohammad Soltanian and Shahrokh Ghaemmaghami. 2018. Hierarchical concept score postprocessing and concept-wise normalization in CNN-based video event recognition. IEEE Trans. Multim. 21, 1 (2018), 157–172.
[62]
Nitish Srivastava and Ruslan R. Salakhutdinov. 2012. Multimodal learning with deep Boltzmann machines. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 2222–2230.
[63]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of residual connections on learning. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence. 4278–4284.
[64]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826.
[65]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.
[66]
Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision. Springer, 20–36.
[67]
Yulin Wang, Zhaoxi Chen, Haojun Jiang, Shiji Song, Yizeng Han, and Gao Huang. 2021. Adaptive focus for efficient video recognition. In ICCV. 16229–16238.
[68]
Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, and Shilei Wen. 2019. Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition. In Proceedings of the IEEE International Conference on Computer Vision. 6222–6231.
[69]
Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, and Leonid Sigal. 2016. Harnessing object and scene semantics for large-scale video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3112–3121.
[70]
Zuxuan Wu, Yu-Gang Jiang, Jun Wang, Jian Pu, and Xiangyang Xue. 2014. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In Proceedings of the 22nd ACM International Conference on Multimedia. ACM, 167–176.
[71]
Zuxuan Wu, Caiming Xiong, Yu-Gang Jiang, and Larry S. Davis. 2019. LiteEval: A coarse-to-fine framework for resource efficient video recognition. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 7778–7787.
[72]
Zuxuan Wu, Caiming Xiong, Chih-Yao Ma, Richard Socher, and Larry S. Davis. 2019. AdaFrame: Adaptive frame selection for fast video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1278–1287.
[73]
Wenlong Xie, Hongxun Yao, Xiaoshuai Sun, Tingting Han, Sicheng Zhao, and Tat-Seng Chua. 2019. Discovering latent discriminative patterns for multi-mode event representation. IEEE Trans. Multim. 21, 6 (2019), 1425–1436.
[74]
Wenlong Xie, Hongxun Yao, Xiaoshuai Sun, Tingting Han, Sicheng Zhao, and Tat-Seng Chua. 2018. Discovering latent discriminative patterns for multi-mode event representation. IEEE Trans. Multim. 21, 6 (2018), 1425–1436.
[75]
Zhongwen Xu, Ivor W. Tsang, Yi Yang, Zhigang Ma, and Alexander G. Hauptmann. 2014. Event detection using multi-level relevance labels and multiple features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 97–104.
[76]
Yan Yan, Yi Yang, Haoquan Shen, Deyu Meng, Gaowen Liu, Alex Hauptmann, and Nicu Sebe. 2015. Complex event detection via event oriented dictionary learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence.
[77]
Brandon Yang, Gabriel Bender, Quoc V. Le, and Jiquan Ngiam. 2019. CondConv: Conditionally parameterized convolutions for efficient inference. In Proceedings of the International Conference on Advances in Neural Information Processing Systems. 1305–1316.
[78]
Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2016. Semantic feature mining for video event understanding. ACM Trans. Multim. Comput., Commun. Applic. 12, 4 (2016), 1–22.
[79]
Guangnan Ye, Yitong Li, Hongliang Xu, Dong Liu, and Shih-Fu Chang. 2015. EventNet: A large scale structured concept library for complex event detection in video. In Proceedings of the 23rd ACM International Conference on Multimedia. ACM, 471–480.
[80]
Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations.
[81]
Ji Zhang, Kuizhi Mei, Yu Zheng, and Jianping Fan. 2019. Exploiting mid-level semantics for large-scale complex video classification. IEEE Trans. Multim. (2019).
[82]
Linguang Zhang, Maciej Halber, and Szymon Rusinkiewicz. 2019. Accelerating large-kernel convolution using summed-area tables. arXiv preprint arXiv:1906.11367 (2019).
[83]
Rui-Wei Zhao, Qi Zhang, Zuxuan Wu, Jianguo Li, and Yu-Gang Jiang. 2019. Visual content recognition by exploiting semantic feature map with attention and multi-task learning. ACM Trans. Multim. Comput., Commun. Applic. 15, 1s (2019), 6:1–6:22.
[84]
Sicheng Zhao, Yue Gao, Guiguang Ding, and Tat-Seng Chua. 2018. Real-time multimedia social event detection in microblog. IEEE Trans. Cybern. 48, 11 (2018), 3218–3231.
[85]
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2018. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40, 6 (2018), 1452–1464.

Cited By

View all
  • (2025)Mixed Attention and Channel Shift Transformer for Efficient Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3712594Online publication date: 17-Jan-2025
  • (2024)Review on scene graph generation methodsMultiagent and Grid Systems10.3233/MGS-23013220:2(129-160)Online publication date: 12-Aug-2024
  • (2024)Cross-Attention Based Two-Branch Networks for Document Image Forgery Localization in the MetaverseACM Transactions on Multimedia Computing, Communications, and Applications10.1145/368615821:2(1-24)Online publication date: 30-Dec-2024
  • Show More Cited By

Index Terms

  1. Temporal Dynamic Concept Modeling Network for Explainable Video Event Recognition

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 6
    November 2023
    858 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3599695
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 July 2023
    Online AM: 25 October 2022
    Accepted: 25 September 2022
    Revised: 10 July 2022
    Received: 28 February 2022
    Published in TOMM Volume 19, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Event recognition
    2. temporal concept receptive field
    3. dynamic convolution

    Qualifiers

    • Research-article

    Funding Sources

    • Technology and Innovation Major Project of the Ministry of Science and Technology of China
    • National Natural Science Foundation of China
    • Beijing Nova Program
    • Fundamental Research Funds for the Central Universities

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)96
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Mixed Attention and Channel Shift Transformer for Efficient Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3712594Online publication date: 17-Jan-2025
    • (2024)Review on scene graph generation methodsMultiagent and Grid Systems10.3233/MGS-23013220:2(129-160)Online publication date: 12-Aug-2024
    • (2024)Cross-Attention Based Two-Branch Networks for Document Image Forgery Localization in the MetaverseACM Transactions on Multimedia Computing, Communications, and Applications10.1145/368615821:2(1-24)Online publication date: 30-Dec-2024
    • (2024)Fair and Robust Federated Learning via Decentralized and Adaptive Aggregation based on BlockchainACM Transactions on Sensor Networks10.1145/3673656Online publication date: 17-Jun-2024
    • (2024)Push the Limit of Highly Accurate Ranging on Commercial UWB DevicesProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596028:2(1-27)Online publication date: 15-May-2024
    • (2024)xMeta: SSD-HDD-hybrid Optimization for Metadata Maintenance of Cloud-scale Object StorageACM Transactions on Architecture and Code Optimization10.1145/365260621:2(1-20)Online publication date: 21-May-2024
    • (2024)Suitable and Style-Consistent Multi-Texture Recommendation for Cartoon IllustrationsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365251820:7(1-26)Online publication date: 16-May-2024
    • (2024)MS-GDA: Improving Heterogeneous Recipe Representation via Multinomial Sampling Graph Data AugmentationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364862020:7(1-23)Online publication date: 25-Apr-2024
    • (2024)MSEConv: A Unified Warping Framework for Video Frame InterpolationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3648364Online publication date: 14-Feb-2024
    • (2024)GMS-3DQA: Projection-Based Grid Mini-patch Sampling for 3D Model Quality AssessmentACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364381720:6(1-19)Online publication date: 8-Mar-2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media