Skip to main content
Log in

Blind consumer video quality assessment with spatial-temporal perception and fusion

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Blind quality assessment for user-generated content (UGC) or consumer videos is challenging in computer vision. Two open issues are yet to be addressed: how to effectively extract high-dimensional spatial-temporal features of consumer videos and how to appropriately model the relationship between these features and user perceptions within a unified blind video quality assessment (BVQA). To tackle these issues, we propose a novel BVQA model with spatial-temporal perception and fusion. Firstly, we develop two perception modules to extract the perceptual-distortion-related features separately from the spatial and temporal domains. In particular, the temporal-domain features are obtained with a combination of 3D ConvNet and residual frames for their high efficiencies in capturing the motion-specific temporal features. Secondly, we propose a feature fusion module that adaptively combines spatial-temporal features. Finally, we map the fused features onto perceptual quality. Experimental results demonstrate that our model outperforms other advanced methods in conducting subjective video quality prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

Data availability is not applicable to this article.

The datasets generated during and/or analysed during the current study are available in the following repository:

- KoNViD-1k http://database.mmsp-kn.de/konvid-1k-database.html

- LIVE VQC https://live.ece.utexas.edu/research/LIVEVQC/index.html

- YouTube-UGC https://media.withyoutube.com/

The project corresponding to this manuscript is available through the link https://github.com/790578527/STFN.

References

  1. Argyropoulos S, Raake A, Garcia MN, List P (2011) No-reference video quality assessment for SD and HD H. 264/AVC sequences based on continuous estimates of packet loss visibility. In: International Workshop on Quality of Multimedia Experience (QoMEX), pp. 31–36

  2. Chen Z, Wu D (2011) Prediction of transmission distortion for wireless video communication: Analysis. IEEE Trans Image Process 21(3):1123–1137

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  3. Chen C, Izadi M, Kokaram A (2016) A perceptual quality metric for videos distorted by spatially correlated noise. In: ACM International Conference on Multimedia, pp. 1277–1285

  4. Chen P, Li L, Ma L, Wu J, Shi G (2020) Rirnet: Recurrent-in-recurrent network for video quality assessment. In: ACM International Conference on Multimedia, pp. 834–842

  5. Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Conference on Empirical Methods in Natural Language Processing (EMNLP)

  6. Corbetta M, Shulman GL (2002) Control of goal-directed and stimulus-driven attention in the brain. Nat Rev Neurosci 3(3):201–215

    Article  CAS  PubMed  Google Scholar 

  7. Dendi SVR, Channappayya SS (2020) No-reference video quality assessment using natural spatiotemporal scene statistics. IEEE Trans Image Process 29:5612–5624

    Article  ADS  Google Scholar 

  8. Dong S, Wang P, Abbas K (2021) A survey on deep learning and its applications. Computer Science Review 40(1):100379

    Article  MathSciNet  Google Scholar 

  9. Ghadiyaram D, Bovik AC (2017) Perceptual quality prediction on authentically distorted images using a bag of features approach. J Vis 17(1):32

    Article  PubMed  PubMed Central  Google Scholar 

  10. Group VQE, et al (2000) Final report from the video quality experts group on the validation of objective models of video quality assessment. In: VQEG Meeting

  11. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3D cnns retrace the history of 2D cnns and imagenet? In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555

  12. Hermens F, Luksys G, Gerstner W, Herzog MH, Ernst U (2008) Modeling spatial and temporal aspects of visual backward masking, vol. 115, pp. 83–100

  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778

  14. Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing 29, 4041–4056 (2020)

  15. Hosu V, Hahn F, Jenadeleh M, Lin H, Men H, Szirányi T, Li S, Saupe D (2017) The konstanz natural video database (KoNViD-1k). In: International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6

  16. Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259

    Article  Google Scholar 

  17. Keimel C, Habigt J, Klimpke M, Diepold K (2011) Design of no-reference video quality metrics with multiway partial least squares regression. In: International Workshop on Quality of Multimedia Experience (QoMEX), pp. 49–54

  18. Kingma DP, Ba J (2015) Adam: A Method for Stochastic Optimization. In: International Conference on Learning Representations (ICLR)

  19. Korhonen J (2018) Learning-based prediction of packet loss artifact visibility in networked video. In: International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6

  20. Korhonen J (2019) Two-level approach for no-reference consumer video quality assessment. IEEE Trans Image Process 28(12):5923–5938

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  21. Korhonen J, Su Y, You J (2020) Blind natural video quality prediction via statistical temporal features and deep spatial features. In: ACM International Conference on Multimedia, pp. 3311–3319

  22. Kundu D, Ghadiyaram D, Bovik AC, Evans BL (2017) No-reference quality assessment of tone-mapped hdr pictures. IEEE Trans Image Process 26(6):2957–2971

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  23. Larochelle H, Hinton GE (2010) Learning to combine foveal glimpses with a third-order Boltzmann machine. In: NIPS

  24. Li Y, Po L-M, Cheung C-H, Xu X, Feng L, Yuan F, Cheung K-W (2015) No-reference video quality assessment with 3D shearlet transform and convolutional neural networks. IEEE Trans Circuits Syst Video Technol 26(6):1044–1057

    Article  Google Scholar 

  25. Li D, Jiang T, Jiang M (2019) Quality assessment of in-the-wild videos. In: ACM International Conference on Multimedia, pp. 2351–2359

  26. Mittal A, Soundararajan R, Bovik AC (2012) Making a “completely blind’’ image quality analyzer. IEEE Signal Process Lett 20(3):209–212

    Article  ADS  Google Scholar 

  27. Mittal A, Moorthy AK, Bovik AC (2012) No-reference image quality assessment in the spatial domain. IEEE Trans Image Process 21(12):4695–4708

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  28. Mittal A, Saad MA, Bovik AC (2015) A completely blind video integrity oracle. IEEE Trans Image Process 25(1):289–300

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  29. Murdock BB Jr (1962) The serial position effect of free recall. J Exp Psychol 64(5):482

    Article  Google Scholar 

  30. Niu Y, Liu F (2012) What Makes a Professional Video? A Computational Aesthetics Approach. IEEE Trans Circuits Syst Video Technol 22(7):1037–1049

    Article  Google Scholar 

  31. Pandremmenou K, Shahid M, Kondi LP, Lövström B (2015) A no-reference bitstream-based perceptual model for video quality estimation of videos affected by coding artifacts and packet losses. In: Human Vision and Electronic Imaging XX, vol. 9394, pp. 486–497

  32. Park J, Seshadrinathan K, Lee S, Bovik AC (2012) Video quality pooling adaptive to perceptual distortion severity. IEEE Trans Image Process 22(2):610–620

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  33. Pinson MH, Janowski L, Pépion R, Huynh-Thu Q, Schmidmer C, Corriveau P, Younkin A, Le Callet P, Barkowsky M, Ingram W (2012) The influence of subjects and environment on audiovisual subjective tests: An international study. IEEE Journal of Selected Topics in Signal Processing 6(6):640–651

    Article  ADS  Google Scholar 

  34. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3D residual networks. In: IEEE International Conference on Computer Vision, pp. 5533–5541

  35. Rensink RA (2000) The dynamic representation of scenes. Vis Cogn 7(1–3):17–42

    Article  Google Scholar 

  36. Saad MA, Bovik AC, Charrier C (2012) Blind image quality assessment: A natural scene statistics approach in the DCT domain. IEEE Trans Image Process 21(8):3339–3352

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  37. Saad MA, Bovik AC, Charrier C (2014) Blind prediction of natural video quality. IEEE Trans Image Process 23(3):1352–1365

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  38. Seshadrinathan K, Bovik AC (2011) Temporal hysteresis model of time varying subjective video quality. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1153–1156

  39. Siahaan E, Hanjalic A, Redi JA (2018) Semantic-aware blind image quality assessment. Signal Processing: Image Communication 60:237–252

    Google Scholar 

  40. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICLR)

  41. Sinno Z, Bovik AC (2018) Large-scale study of perceptual video quality. IEEE Trans Image Process 28(2):612–627

    Article  ADS  MathSciNet  Google Scholar 

  42. Søgaard J, Forchhammer S, Korhonen J (2015) No-reference video quality assessment using codec analysis. IEEE Trans Circuits Syst Video Technol 25(10):1637–1650

    Article  Google Scholar 

  43. Tao L, Wang X, Yamasaki T (2021) Rethinking motion representation: Residual frames with 3D convnets. IEEE Trans Image Process 30:9231–9244

    Article  ADS  PubMed  Google Scholar 

  44. Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li L-J (2016) YFCC100M: The new data in multimedia research. Commun ACM 59(2):64–73

    Article  Google Scholar 

  45. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497

  46. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459

  47. Tu Z, Wang Y, Birkbeck N, Adsumilli B, Bovik AC (2021) UGC-VQA: Benchmarking blind video quality assessment for user generated content. IEEE Trans Image Process 30:4449–4464

    Article  ADS  PubMed  Google Scholar 

  48. Valenzise G, Magni S, Tagliasacchi M, Tubaro S (2011) No-reference pixel video quality monitoring of channel-induced distortion. IEEE Trans Circuits Syst Video Technol 22(4):605–618

    Article  Google Scholar 

  49. Vega MT, Mocanu DC, Stavrou S, Liotta A (2017) Predictive no-reference assessment of video quality. Signal Processing: Image Communication 52:20–32

    Google Scholar 

  50. Wang Y, Inguva S, Adsumilli B (2019) YouTube UGC dataset for video compression research. In: IEEE International Workshop on Multimedia Signal Processing (MMSP), pp. 1–5

  51. Woo, S., Park J, Lee J, Kweon IS (2018) Cbam: Convolutional block attention module. In: European Conference on Computer Vision (ECCV), pp. 3–19

  52. Wu J, Zeng J, Dong W, Shi G, Lin W (2019) Blind image quality assessment with hierarchy: Degradation from local structure to deep semantics. J Vis Commun Image Represent 58:353–362

    Article  Google Scholar 

  53. Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: European Conference on Computer Vision (ECCV), pp. 305–321

  54. Xu M, Chen J, Wang H, Liu S, Li G, Bai Z (2020) C3DVQA: Full-reference video quality assessment with 3D convolutional neural network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4447–4451

  55. Xue W, Mou X, Zhang L, Bovik AC, Feng X (2014) Blind image quality assessment using joint statistics of gradient magnitude and Laplacian features. IEEE Trans Image Process 23(11):4850–4862

    Article  ADS  MathSciNet  PubMed  Google Scholar 

  56. Ye P, Kumar J, Kang L, Doermann D (2012) Unsupervised feature learning framework for no-reference image quality assessment. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1098–1105

  57. Ying Z, Mandal M, Ghadiyaram D, Bovik A (2021) Patch-vq: ’patching up’ the video quality problem. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 14019–14029

  58. Ying Z, Niu H, Gupta P, Mahajan D, Ghadiyaram D, Bovik A (2020) From patches to pictures (PaQ-2-PiQ): Mapping the perceptual space of picture quality. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3575–3585

  59. You J, Korhonen J (2019) Deep neural networks for no-reference video quality assessment. In: IEEE International Conference on Image Processing (ICIP), pp. 2349–2353

  60. Zhang Y, Moorthy AK, Chandler DM, Bovik AC (2014) C-DIIVINE: No-reference image quality assessment based on local magnitude and phase statistics of natural scenes. Signal Processing: Image Communication 29(7):725–747

    Google Scholar 

  61. Zhu K, Li C, Asari V, Saupe D (2014) No-reference video quality assessment based on artifact measurement and statistical analysis. IEEE Trans Circuits Syst Video Technol 25(4):533–546

    Article  Google Scholar 

Download references

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62072110, 61972097, and U21A20472, in part by the Major Science and Technology project of Fujian Province (China) under Granted 2021HZ022007, in part by the Industry-Academy Cooperation Project under Grant 2021H6022, in part by the Natural Science Foundation of Fujian Province under Grant 2020J01494.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuzhen Niu.

Ethics declarations

Conflict of interest

The authors declare that they have conflict of interest with all researcheres at Fuzhou University, China.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Abbreviations List

Appendix A: Abbreviations List

Table 7 shows the abbreviation correspondence in the paper.

Table 7 Correspondence table of abbreviations used in the paper

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Niu, Y., Zheng, Y., Wang, Z. et al. Blind consumer video quality assessment with spatial-temporal perception and fusion. Multimed Tools Appl 83, 18969–18986 (2024). https://doi.org/10.1007/s11042-023-16242-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16242-8

Keywords

Navigation