Skip to main content

Class Agnostic, On-Device and Privacy Preserving Repetition Counting of Actions from Videos Using Similarity Bottleneck

  • Conference paper
  • First Online:
Computer Vision and Image Processing (CVIP 2022)

Abstract

We present a practical, privacy-preserving on-device method to get the repetition count of an action in a given video stream. Our approach relies on calculating the pairwise similarity between each sampled frame of the video, using the per frame features extracted by the feature extraction module and a suitable distance metric in the temporal self-similarity(TSM) calculation module. We pass this calculated TSM matrix to the count prediction module to arrive at the repetition count of the action in the given video. The count prediction module is deliberately designed to not pay any attention to the extracted per frame features which are video specific. This self-similarity bottleneck enables the model to be class agnostic and allows generalization to actions not observed during training. We utilize the largest available dataset for repetition counting, Countix, for training and evaluation. We also propose a way for effectively augmenting the training data in Countix. Our experiments show SOTA comparable accuracies with significantly smaller model footprints.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Levy, O., Wolf, L.: Live repetition counting. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3020–3028 (2015)

    Google Scholar 

  2. Runia, T.F.H., Snoek, C.G.M., Smeulders, A.W.M.: Real-world repetition estimation by div, grad and curl. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9009–9017 (2018)

    Google Scholar 

  3. Zhang, H., Xu, X., Han, G., He, S.: Context-aware and scale-insensitive temporal repetition counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 670–678 (2020)

    Google Scholar 

  4. Zhang, Y., Shao, L., Snoek, C.G.M.: Repetitive activity counting by sight and sound. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14070–14079 (2021)

    Google Scholar 

  5. Hu, H., Dong, A., Zhao, Y., Lian, D., Li, Z., Gao, S.: TransRAC: encoding multi-scale temporal correlation with transformers for repetitive action counting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19013–19022 (2022)

    Google Scholar 

  6. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  7. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  8. Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: Counting out time: class agnostic video repetition counting in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10387–10396 (2020)

    Google Scholar 

  9. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  10. Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

  11. Ran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018)

    Google Scholar 

  12. Cutler, R., Davis, L.S.: Robust real-time periodic motion detection, analysis, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 781–796 (2000)

    Article  Google Scholar 

  13. Körner, M., Denzler, J.: Temporal self-similarity for appearance-based action recognition in multi-view setups. In: Wilson, R., Hancock, E., Bors, A., Smith, W. (eds.) CAIP 2013. LNCS, vol. 8047, pp. 163–171. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40261-6_19

    Chapter  Google Scholar 

  14. Dwibedi, D., Misra, I., Hebert, M.: Cut, paste and learn: surprisingly easy synthesis for instance detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1301–1310 (2017)

    Google Scholar 

  15. Tremblay, J., et al.: Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 969–977 (2018)

    Google Scholar 

  16. Kang, H., Kim, J., Kim, K., Kim, T., Kim, S.J.: Winning the CVPR’2021 Kinetics-GEBD challenge: contrastive learning approach. arXiv preprint arXiv:2106.11549 (2021)

  17. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  18. Science Focus. https://www.sciencefocus.com/future-technology/how-much-data-is-on-the-internet/. Accessed 13 July 2022

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Rishabh Khurana , Jayesh Rajkumar Vachhani or S Rakshith .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Khurana, R., Vachhani, J.R., Rakshith, S., Gothe, S.V. (2023). Class Agnostic, On-Device and Privacy Preserving Repetition Counting of Actions from Videos Using Similarity Bottleneck. In: Gupta, D., Bhurchandi, K., Murala, S., Raman, B., Kumar, S. (eds) Computer Vision and Image Processing. CVIP 2022. Communications in Computer and Information Science, vol 1777. Springer, Cham. https://doi.org/10.1007/978-3-031-31417-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-31417-9_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-31416-2

  • Online ISBN: 978-3-031-31417-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics