Self-supervised Representation Learning for Fine Grained Human Hand Action Recognition in Industrial Assembly Lines

Sturm, Fabian; Sathiyababu, Rahul; Allipilli, Harshitha; Hergenroether, Elke; Siegel, Melanie

doi:10.1007/978-3-031-47969-4_14

Fabian Sturm^16,17,
Rahul Sathiyababu¹⁶,
Harshitha Allipilli¹⁶,
Elke Hergenroether¹⁷ &
…
Melanie Siegel¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14361))

Included in the following conference series:

International Symposium on Visual Computing

455 Accesses

Abstract

Humans are still indispensable on industrial assembly lines, but in the event of an error, they need support from intelligent systems. In addition to the objects to be observed, it is equally important to understand the fine-grained hand movements of a human to be able to track the entire process. However, these deep learning based hand action recognition methods are very label intensive, which cannot be offered by all industrial companies due to the associated costs. This work therefore presents a self-supervised learning approach for industrial assembly processes that allows a spatio-temporal transformer architecture to be pre-trained on a variety of information from real-world video footage of daily life. Subsequently, this deep learning model is adapted to the industrial assembly task at hand using only a few labels. It is shown which known real-world datasets are best suited for representation learning of these hand actions in a regression task, and to what extent they optimize the subsequent supervised trained classification task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Autoencoders, S.D.: Learning useful representations in a deep network with a local denoising criterion, Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio and Pierre-Antoine Manzagol. J. Mach. Learn. Res. ll, 3371–3408 (2010)
Google Scholar
Cao, S., Xu, P., Clifton, D.A.: How to understand masked autoencoders. arXiv preprint arXiv:2202.03670 (2022)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/n19-1423
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale (2021)
Google Scholar
Feichtenhofer, C., Li, Y., He, K., et al.: Masked autoencoders as spatiotemporal learners. Adv. Neural. Inf. Process. Syst. 35, 35946–35958 (2022)
Google Scholar
Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense (2017). https://doi.org/10.48550/ARXIV.1706.04261, https://arxiv.org/abs/1706.04261
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Google Scholar
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUS). arXiv preprint arXiv:1606.08415 (2016)
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Article MathSciNet MATH Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 448–456. JMLR.org (2015). https://proceedings.mlr.press/v37/ioffe15.html
Li, Y., Si, S., Li, G., Hsieh, C.J., Bengio, S.: Learnable fourier features for multi-dimensional spatial positional encoding (2021)
Google Scholar
Li, Y., Liu, M., Rehg, J.M.: In the eye of the beholder: gaze and actions in first person video (2020). https://doi.org/10.48550/ARXIV.2006.00626, https://arxiv.org/abs/2006.00626
Lin, T., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. CoRR abs/1612.03144 (2016). http://arxiv.org/abs/1612.03144
Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. CoRR abs/1708.02002 (2017). http://arxiv.org/abs/1708.02002
Liu, M., Ren, S., Ma, S., Jiao, J., Chen, Y., Wang, Z., Song, W.: Gated transformer networks for multivariate time series classification. CoRR abs/2103.14438 (2021). https://arxiv.org/abs/2103.14438
Mahdisoltani, F., Berger, G., Gharbieh, W., Fleet, D.J., Memisevic, R.: Fine-grained video classification and captioning. CoRR abs/1804.09235 (2018). http://arxiv.org/abs/1804.09235
Ng, A.: Sparse autoencoder (NA). https://www.stanford.edu/class/cs294a/sparseAutoencoder.pdf
Sturm, F., Hergenroether, E., Reinhardt, J., Vojnovikj, P.S., Siegel, M.: Challenges of the creation of a dataset for vision based human hand action recognition in industrial assembly. In: Arai, K. (ed.) SAI 2023. LNNS, vol. 711, pp. 1079–1098. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-37717-4_70
Chapter Google Scholar
Tang, P., Zhang, X.: MTSMAE: masked autoencoders for multivariate time-series forecasting. In: 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 982–989. IEEE (2022)
Google Scholar
Tong, Z., Song, Y., Wang, J., Wang, L.: VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural. Inf. Process. Syst. 35, 10078–10093 (2022)
Google Scholar
Trockman, A., Kolter, J.Z.: Patches are all you need? Trans. Mach. Learn. Res. 2023 (2022)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning, pp. 1096–1103 (2008). https://doi.org/10.1145/1390156.1390294
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.A., Bottou, L.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11(12) (2010)
Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 98–106 (2016)
Google Scholar
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 391–408 (2018)
Google Scholar
Wu, W., Hua, Y., Wu, S., Chen, C., Lu, A., et al.: SkeletonMAE: spatial-temporal masked autoencoders for self-supervised skeleton action recognition. arXiv preprint arXiv:2209.02399 (2022)
Xie, Z., et al.: SimMIM: a simple framework for masked image modeling. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9643–9653 (2021)
Google Scholar
Zerveas, G., Jayaraman, S., Patel, D., Bhamidipaty, A., Eickhoff, C.: A transformer-based framework for multivariate time series representation learning. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2114–2124 (2021)
Google Scholar
Zhang, F., et al.: MediaPipe hands: on-device real-time hand tracking. CoRR abs/2006.10214 (2020). https://arxiv.org/abs/2006.10214

Download references

Author information

Authors and Affiliations

Bosch Rexroth AG, Lise-Meitner-Straße 4, 89081, Ulm, Germany
Fabian Sturm, Rahul Sathiyababu & Harshitha Allipilli
University of Applied Sciences Darmstadt, Schöefferstraße 3, 64295, Darmstadt, Germany
Fabian Sturm, Elke Hergenroether & Melanie Siegel

Authors

Fabian Sturm
View author publications
You can also search for this author in PubMed Google Scholar
Rahul Sathiyababu
View author publications
You can also search for this author in PubMed Google Scholar
Harshitha Allipilli
View author publications
You can also search for this author in PubMed Google Scholar
Elke Hergenroether
View author publications
You can also search for this author in PubMed Google Scholar
Melanie Siegel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabian Sturm .

Editor information

Editors and Affiliations

University of Nevada Reno, Reno, NV, USA
George Bebis
Google Research, Mountain View, CA, USA
Golnaz Ghiasi
New York University, New York, USA
Yi Fang
Ben-Gurion University, Be'er Sheva, Israel
Andrei Sharf
Microsoft Research, Beijing, China
Yue Dong
The University of Oklahoma, Norman, OK, USA
Chris Weaver
University of Maryland, Collage Park, MD, USA
Zhicheng Leo
University of Central Florida, Orlando, FL, USA
Joseph J. LaViola Jr.
InnerOptic Technology, Hillsborough, NC, USA
Luv Kohli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sturm, F., Sathiyababu, R., Allipilli, H., Hergenroether, E., Siegel, M. (2023). Self-supervised Representation Learning for Fine Grained Human Hand Action Recognition in Industrial Assembly Lines. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2023. Lecture Notes in Computer Science, vol 14361. Springer, Cham. https://doi.org/10.1007/978-3-031-47969-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-47969-4_14
Published: 01 December 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47968-7
Online ISBN: 978-3-031-47969-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Self-supervised Representation Learning for Fine Grained Human Hand Action Recognition in Industrial Assembly Lines