Abstract
Action quality assessment (AQA) aims at automatically judging human action based on a video of the said action and assigning a performance score to it. The majority of works in the existing literature on AQA divide RGB videos into short clips, transform these clips to higher-level representations using Convolutional 3D (C3D) networks, and aggregate them through averaging. These higher-level representations are used to perform AQA. We find that the current clip level feature aggregation technique of averaging is insufficient to capture the relative importance of clip level features. In this work, we propose a learning-based weighted-averaging technique. Using this technique, better performance can be obtained without sacrificing too much computational resources. We call this technique Weight-Decider(WD). We also experiment with ResNets for learning better representations for action quality assessment. We assess the effects of the depth and input clip size of the convolutional neural network on the quality of action score predictions. We achieve a new state-of-the-art Spearman’s rank correlation of 0.9315 (an increase of 0.45%) on the MTL-AQA dataset using a 34 layer (2+1)D ResNet with the capability of processing 32 frame clips, with WD aggregation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Weights available at: https://github.com/kenshohara/3D-ResNets-PyTorch.
- 2.
Weights available at: https://github.com/moabitcoin/ig65m-pytorch.
References
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 4724–4733. IEEE Computer Society (2017). https://doi.org/10.1109/CVPR.2017.502
Diba, A., et al.: Spatio-temporal channel correlation networks for action classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 299–315. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_18
Funke, I., Mees, S.T., Weitz, J., Speidel, S.: Video-based surgical skill assessment using 3D convolutional neural networks. Int. J. Comput. Assist. Radiol. Surg. 14(7), 1217–1225 (2019). https://doi.org/10.1007/s11548-019-01995-1
Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre-training for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019, pp. 12046–12055. Computer Vision Foundation/IEEE (2019). https://doi.org/10.1109/CVPR.2019.01232
Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3d residual networks for action recognition. In: 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, 22–29 October 2017, pp. 3154–3160. IEEE Computer Society (2017). https://doi.org/10.1109/ICCVW.2017.373
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778. IEEE Computer Society (2016). https://doi.org/10.1109/CVPR.2016.90
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, 23–28 June 2014, pp. 1725–1732. IEEE Computer Society (2014). https://doi.org/10.1109/CVPR.2014.223
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 abs/1705.06950 (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015). http://arxiv.org/abs/1412.6980
LeCun, Y., Bengio, Y.: Convolutional Networks for Images, Speech, and Time Series, pp. 255–258. MIT Press, Cambridge (1998). https://doi.org/10.5555/303568.303704
Leong, M., Prasad, D., Lee, Y.T., Lin, F.: Semi-CNN architecture for effective spatio-temporal learning in action recognition. Appl. Sci. 10, 557 (2020). https://doi.org/10.3390/app10020557
Parmar, P., Morris, B.T.: Measuring the quality of exercises. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 2241–2244 (2016). https://doi.org/10.1109/EMBC.2016.7591175
Parmar, P., Morris, B.: Action quality assessment across multiple actions. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, 7–11 January 2019, pp. 1468–1476. IEEE (2019). https://doi.org/10.1109/WACV.2019.00161
Parmar, P., Morris, B.T.: Learning to score Olympic events. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 76–84. IEEE Computer Society (2017). https://doi.org/10.1109/CVPRW.2017.16
Parmar, P., Morris, B.T.: What and how well you performed? A multitask learning approach to action quality assessment. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019, pp. 304–313. Computer Vision Foundation/IEEE (2019). https://doi.org/10.1109/CVPR.2019.00039
Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS 2017 Workshop on Autodiff. Long Beach, California, USA (2017). https://openreview.net/forum?id=BJJsrmfCZ
Pirsiavash, H., Vondrick, C., Torralba, A.: Assessing the quality of actions. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 556–571. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_36
Tang, Y., et al.: Uncertainty-aware score distribution learning for action quality assessment. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020, pp. 9836–9845. IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00986
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, 7–13 December 2015, pp. 4489–4497. IEEE Computer Society (2015). https://doi.org/10.1109/ICCV.2015.510
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018, pp. 6450–6459. IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00675
Xiang, X., Tian, Y., Reiter, A., Hager, G.D., Tran, T.D.: S3D: stacking segmental P3D for action quality assessment. In: 2018 IEEE International Conference on Image Processing, ICIP 2018, Athens, Greece, 7–10 October 2018, pp. 928–932. IEEE (2018). https://doi.org/10.1109/ICIP.2018.8451364
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Farabi, S., Himel, H., Gazzali, F., Hasan, M.B., Kabir, M.H., Farazi, M. (2022). Improving Action Quality Assessment Using Weighted Aggregation. In: Pinho, A.J., Georgieva, P., Teixeira, L.F., Sánchez, J.A. (eds) Pattern Recognition and Image Analysis. IbPRIA 2022. Lecture Notes in Computer Science, vol 13256. Springer, Cham. https://doi.org/10.1007/978-3-031-04881-4_46
Download citation
DOI: https://doi.org/10.1007/978-3-031-04881-4_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04880-7
Online ISBN: 978-3-031-04881-4
eBook Packages: Computer ScienceComputer Science (R0)