skip to main content
10.1145/3689093.3689181acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

MT-VQA: A Multi-task Approach for Quality Assessment of Short-form Videos

Published: 28 October 2024 Publication History

Abstract

Short-form video, as a mainstream media form on video platforms, has undergone explosive growth in recent years. A vast number of short-form videos are produced, processed, and distributed to users each day, inevitably leading to quality degradation. Therefore, accurate video quality assessment (VQA) is critical for monitoring and optimizing the viewing experience of users. However, the existing short-form VQA approaches neglect human attention patterns during the viewing of videos. Besides, the advancement of short-form VQA is obstructed by the absence of large-scale datasets. To tackle the above challenges, we first construct a large-scale short-form VQA dataset called SVQA. The SVQA dataset comprises diverse distortion types, covering the typical quality degradations that arise during the photography, encoding, and editing of short-form videos. Besides, for each short-form video in SVQA, we collect both quality score and eye-tracking annotation. Based on our dataset, we propose a two-branch multi-task VQA approach, MT-VQA, in which both tasks of VQA and video saliency prediction (VSP) can be accomplished for short-form videos. We further propose a saliency fusion module to guide the VQA branch to focus on quality distortions within visually attractive regions. Extensive experiments show that our multi-task approach achieves superior performance in both VQA and VSP tasks.

References

[1]
Sarah Anderson. 2024. TikTok Stats and Analytics to Know in 2024. (2024). https://www.socialchamp.io/blog/tiktok-stats/
[2]
Cagdas Bak, Aysun Kocak, Erkut Erdem, and Aykut Erdem. 2017. Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Transactions on Multimedia 20, 7 (2017), 1688--1698.
[3]
Giovanni Bellitto, Federica Proietto Salanitri, Simone Palazzo, Francesco Rundo, Daniela Giordano, and Concetto Spampinato. 2021. Hierarchical domain-adapted feature learning for video saliency prediction. International Journal of Computer Vision 129, 12 (2021), 3216--3232.
[4]
Baoliang Chen, Lingyu Zhu, Guo Li, Fangbo Lu, Hongfei Fan, and Shiqi Wang. 2021. Learning generalized spatial-temporal deep feature representation for no-reference video quality assessment. IEEE Transactions on Circuits and Systems for Video Technology 32, 4 (2021), 1903--1916.
[5]
Jin Chen, Huihui Song, Kaihua Zhang, Bo Liu, and Qingshan Liu. 2021. Video saliency prediction using enhanced spatiotemporal alignment network. Pattern Recognition 109 (2021), 107615.
[6]
Richard Droste, Jianbo Jiao, and J Alison Noble. 2020. Unified image and video saliency modeling. In European Conference on Computer Vision. Springer, 419--435.
[7]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202--6211.
[8]
Deepti Ghadiyaram and Alan C Bovik. 2015. Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing 25, 1 (2015), 372--387.
[9]
Jonathan Harel, Christof Koch, and Pietro Perona. 2006. Graph-based visual saliency. Advances in neural information processing systems 19 (2006).
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[11]
Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. 2015. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE international conference on computer vision. 262--270.
[12]
Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence 20, 11 (1998), 1254--1259.
[13]
Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, and Vineet Gandhi. 2020. Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 3520--3527.
[14]
Lai Jiang, Mai Xu, Zulin Wang, and Leonid Sigal. 2021. DeepVS2. 0: A saliencystructured deep learning method for predicting dynamic visual attention. International Journal of Computer Vision 129, 1 (2021), 203--224.
[15]
Jari Korhonen. 2019. Two-level approach for no-reference consumer video quality assessment. IEEE Transactions on Image Processing 28, 12 (2019), 5923--5938.
[16]
Qiuxia Lai,WenguanWang, Hanqiu Sun, and Jianbing Shen. 2019. Video saliency prediction using spatiotemporal residual attentive networks. IEEE Transactions on Image Processing 29 (2019), 1113--1126.
[17]
Dingquan Li, Tingting Jiang, and Ming Jiang. 2019. Quality assessment of in-thewild videos. In Proceedings of the 27th ACM international conference on multimedia. 2351--2359.
[18]
Fan Li, Yangfan Zhang, and Pamela C Cosman. 2021. MMMNet: An end-to-end multi-task deep convolution neural network with multi-scale and multi-hierarchy fusion for blind image quality assessment. IEEE Transactions on Circuits and Systems for Video Technology 31, 12 (2021), 4798--4811.
[19]
Yang Li, Shengbin Meng, Xinfeng Zhang, Meng Wang, Shiqi Wang, Yue Wang, and Siwei Ma. 2021. User-generated video quality assessment: A subjective and objective study. IEEE Transactions on Multimedia 25 (2021), 154--166.
[20]
Panagiotis Linardos, Eva Mohedano, Juan Jose Nieto, Noel E O'Connor, Xavier Giro-i Nieto, and Kevin McGuinness. 2019. Simple vs complex temporal recurrences for video saliency prediction. arXiv preprint arXiv:1907.01869 (2019).
[21]
Nian Liu and Junwei Han. 2016. Dhsnet: Deep hierarchical saliency network for salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 678--686.
[22]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012--10022.
[23]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202--3211.
[24]
Yiting Lu, Xin Li, Yajing Pei, Kun Yuan, Qizhi Xie, Yunpeng Qu, Ming Sun, Chao Zhou, and Zhibo Chen. 2024. KVQ: Kwai Video Quality Assessment for Shortform Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 25963--25973.
[25]
Cheng Ma, Haowen Sun, Yongming Rao, Jie Zhou, and Jiwen Lu. 2022. Video Saliency Forecasting Transformer. IEEE Transactions on Circuits and Systems for Video Technology (2022).
[26]
Kyle Min and Jason J Corso. 2019. Tased-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2394--2403.
[27]
Minglang Qiao, Yufan Liu, Mai Xu, Xin Deng, Bing Li, Weiming Hu, and Ali Borji. 2023. Joint learning of audio--visual saliency prediction and sound source localization on multi-face videos. International Journal of Computer Vision (2023), 1--23.
[28]
Minglang Qiao, Mai Xu, Lai Jiang, Peng Lei, ShijieWen, Yunjin Chen, and Leonid Sigal. 2024. HyperSOR: Context-aware Graph Hypernetwork for Salient Object Ranking. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
[29]
Michele A Saad, Alan C Bovik, and Christophe Charrier. 2014. Blind prediction of natural video quality. IEEE Transactions on image Processing 23, 3 (2014), 1352--1365.
[30]
BT Series. 2012. Methodology for the subjective assessment of the quality of television pictures. Recommendation ITU-R BT 500, 13 (2012).
[31]
Wei Sun, Xiongkuo Min, Wei Lu, and Guangtao Zhai. 2022. A deep learning based no-reference quality assessment model for ugc videos. In Proceedings of the 30th ACM International Conference on Multimedia. 856--865.
[32]
Wei Sun, Tao Wang, Xiongkuo Min, Fuwang Yi, and Guangtao Zhai. 2021. Deep learning based full-reference and no-reference quality assessment models for compressed ugc videos. In 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 1--6.
[33]
Zhengzhong Tu, Yilin Wang, Neil Birkbeck, Balu Adsumilli, and Alan C Bovik. 2021. UGC-VQA: Benchmarking blind video quality assessment for user generated content. IEEE Transactions on Image Processing 30 (2021), 4449--4464.
[34]
Wenguan Wang, Jianbing Shen, Jianwen Xie, Ming-Ming Cheng, Haibin Ling, and Ali Borji. 2019. Revisiting video saliency prediction in the deep learning era. IEEE transactions on pattern analysis and machine intelligence 43, 1 (2019), 220--237.
[35]
Ziqiang Wang, Zhi Liu, Gongyang Li, Yang Wang, Tianhong Zhang, Lihua Xu, and JijunWang. 2021. Spatio-Temporal Self-Attention Network for Video Saliency Prediction. IEEE Transactions on Multimedia (2021).
[36]
Shijie Wen, Li Yang, Mai Xu, Minglang Qiao, Tao Xu, and Lin Bai. 2023. Saliency Prediction on Mobile Videos: A Fixation Mapping-Based Dataset and A Transformer Approach. IEEE Transactions on Circuits and Systems for Video Technology (2023).
[37]
Wen Wen, Mu Li, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang, and Kede Ma. 2024. Modular Blind Video Quality Assessment. arXiv preprint arXiv:2402.19276 (2024).
[38]
Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. 2022. Fast-vqa: Efficient end-to-end video quality assessment with fragment sampling. In European conference on computer vision. Springer, 538--554.
[39]
Haoning Wu, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. 2022. Disentangling aesthetic and technical effects for video quality assessment of user generated content. arXiv preprint arXiv:2211.04894 2, 5 (2022), 6.
[40]
Sheng Yang, Qiuping Jiang, Weisi Lin, and Yongtao Wang. 2019. SGDNet: An end-to-end saliency-guided deep neural network for no-reference image quality assessment. In Proceedings of the 27th ACM international conference on multimedia. 1383--1391.
[41]
Sheng Yang, Guosheng Lin, Qiuping Jiang, andWeisi Lin. 2019. Adilated inception network for visual saliency prediction. IEEE Transactions on Multimedia 22, 8 (2019), 2163--2176.
[42]
Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram, and Alan Bovik. 2021. Patch-vq:'patching up'the video quality problem. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14019--14029.
[43]
Qi Zheng, Zhengzhong Tu, Xiaoyang Zeng, Alan C Bovik, and Yibo Fan. 2022. A completely blind video quality evaluator. IEEE Signal Processing Letters 29 (2022), 2228--2232.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
QoEVMA'24: Proceedings of the 3rd Workshop on Quality of Experience in Visual Multimedia Applications
October 2024
63 pages
ISBN:9798400712043
DOI:10.1145/3689093
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. human attention
  2. short-form video
  3. video quality assessment

Qualifiers

  • Research-article

Funding Sources

  • NSFC
  • Beijing Natural Science Foundation

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

QoEVMA'24 Paper Acceptance Rate 6 of 6 submissions, 100%;
Overall Acceptance Rate 14 of 20 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 72
    Total Downloads
  • Downloads (Last 12 months)72
  • Downloads (Last 6 weeks)11
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media