Test-Time Adaptation for Egocentric Action Recognition

Plananamente, Mirco; Plizzari, Chiara; Caputo, Barbara

doi:10.1007/978-3-031-06433-3_18

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13233))

Included in the following conference series:

International Conference on Image Analysis and Processing

1529 Accesses
3 Citations

Abstract

Egocentric action recognition is becoming an increasingly researched topic thanks to the rising popularity of wearable cameras. Despite the numerous publications in the field, the learned representations still suffers from an intrinsic “environmental bias”. To address this issue, domain adaptation and generalization approaches have been proposed, which operate by either adapting the model to target data during training or by learning a model able to generalize to unseen videos by exploiting the knowledge from multiple source domains. In this work, we propose to adapt a model trained on source data to novel environments at test time, making adaptation practical to real-world scenarios where target data are not available at training time. On the popular EPIC-Kitchens dataset, we present a new benchmark for Test-Time Adaptation (TTA) in egocentric action recognition. Moreover, we propose a new multi-modal TTA approach, which we call RNA\(^{++}\), and combine it with a new set of losses aiming at reducing classifier’s uncertainty, showing remarkable results w.r.t. existing TTA methods inherited from image classification. Code available: https://github.com/EgocentricVision/RNA-TTA.

M. Plananamente and C. Plizzari—Equally contributed to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
\(\alpha , \beta , \gamma , \delta , \epsilon \) are the weights of RNA\(^{++}\), MCC, ENT, IM and CENT losses respectively.

References

Azimi, F., Palacio, S., Raue, F., Hees, J., Bertinetto, L., Dengel, A.: Self-supervised test-time adaptation on video data. In: WACV, pp. 3439–3448 (2022)
Google Scholar
Bucci, S., D’Innocente, A., Liao, Y., Carlucci, F.M., Caputo, B., Tommasi, T.: Self-supervised learning across domains. TPAMI (2021)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR. pp. 6299–6308 (2017)
Google Scholar
Chen, H.Y., et al.: Complement objective training. arXiv:1903.01182 (2019)
Chen, M.H., Kira, Z., AlRegib, G., Yoo, J., Chen, R., Zheng, J.: Temporal attentive alignment for large-scale video domain adaptation. In: CVPR, pp. 6321–6330 (2019)
Google Scholar
Choi, J., Sharma, G., Schulter, S., Huang, J.-B.: Shuffle and attend: video domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XII. LNCS, vol. 12357, pp. 678–695. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_40
Chapter Google Scholar
Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: Mars: motion-augmented RGB stream for action recognition. In: CVPR, June 2019
Google Scholar
Damen, D., et al.: Scaling egocentric vision: The epic-kitchens dataset. In: ECCV, pp. 720–736 (2018)
Google Scholar
Dou, Q., Coelho de Castro, D., Kamnitsas, K., Glocker, B.: Domain generalization via model-agnostic learning of semantic features. NIPS 32, 6450–6461 (2019)
Google Scholar
Furnari, A., Farinella, G.: Rolling-unrolling LSTMs for action anticipation from first-person video. TPAMI 43(11), 4021–4036 (2020)
Article Google Scholar
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)
Google Scholar
Gomes, R., Krause, A., Perona, P.: Discriminative clustering by regularized information maximization. In: NIPS (2010)
Google Scholar
Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: NIPS. vol. 367, pp. 281–296, January 2004
Google Scholar
Hu, W., Miyato, T., Tokui, S., Matsumoto, E., Sugiyama, M.: Learning discrete representations via information maximizing self-augmented training. In: ICML, pp. 1558–1567 (2017)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)
Google Scholar
Iwasawa, Y., Matsuo, Y.: Test-time classifier adjustment module for model-agnostic domain generalization. In: Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) NIPS (2021)
Google Scholar
Jamal, A., Namboodiri, V.P., Deodhare, D., Venkatesh, K.: Deep domain adaptation in action space. In: BMVC (2018)
Google Scholar
Jin, Y., Wang, X., Long, M., Wang, J.: Minimum class confusion for versatile domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020, Part XXI. LNCS, vol. 12366, pp. 464–480. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58589-1_28
Chapter Google Scholar
Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Epic-fusion: audio-visual temporal binding for egocentric action recognition. In: ICCV, October 2019
Google Scholar
Kazakos, E., Nagrani, A., Zisserman, A., Damen, D.: Slow-fast auditory streams for audio recognition. In: ICASSP, pp. 855–859 (2021)
Google Scholar
Kim, D., et al.: Learning cross-modal contrastive features for video domain adaptation. In: ICCV, pp. 13618–13627 (2021)
Google Scholar
Li, Y., Wang, N., Shi, J., Hou, X., Liu, J.: Adaptive batch normalization for practical domain adaptation. Pattern Recognit. 80, 109–117 (2018)
Article Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV, pp. 7083–7093 (2019)
Google Scholar
Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: ICML, pp. 97–105 (2015)
Google Scholar
Munro, J., Damen, D.: Multi-modal domain adaptation for fine-grained action recognition. In: CVPR, June 2020
Google Scholar
Nado, Z., Padhy, S., Sculley, D., D’Amour, A., Lakshminarayanan, B., Snoek, J.: Evaluating prediction-time batch normalization for robustness under covariate shift. arXiv preprint arXiv:2006.10963 (2020)
Pan, B., Cao, Z., Adeli, E., Niebles, J.C.: Adversarial cross-domain action recognition with co-attention. In: AAAI, vol. 34, pp. 11815–11822 (2020)
Google Scholar
Planamente, M., Bottino, A., Caputo, B.: Self-supervised joint encoding of motion and appearance for first person action recognition. In: ICPR, pp. 8751–8758 (2021)
Google Scholar
Planamente, M., Plizzari, C., Alberti, E., Caputo, B.: Domain generalization through audio-visual relative norm alignment in first person action recognition. In: WACV, pp. 1807–1818, January 2022
Google Scholar
Plizzari, C., Planamente, M., Alberti, E., Caputo, B.: PoliTO-IIT submission to the epic-kitchens-100 unsupervised domain adaptation challenge for action recognition. arXiv preprint arXiv:2107.00337 (2021)
Plizzari, C., et al.: E\(^2\)(go) motion: Motion augmented event stream for egocentric action recognition. arXiv preprint arXiv:2112.03596 (2021)
Rodin, I., Furnari, A., Mavroedis, D., Farinella, G.M.: Predicting the future from first person (egocentric) vision: a survey. CVIU 211, 103252 (2021)
Google Scholar
Schneider, S., Rusak, E., Eck, L., Bringmann, O., Brendel, W., Bethge, M.: Improving robustness against common corruptions by covariate shift adaptation. arXiv preprint arXiv:2006.16971 (2020)
Shi, Y., Sha, F.: Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. arXiv preprint arXiv:1206.6438 (2012)
Song, X., et al.: Spatio-temporal contrastive domain adaptation for action recognition. In: CVPR, pp. 9787–9795, June 2021
Google Scholar
Sudhakaran, S., Escalera, S., Lanz, O.: LSTA: long short-term attention for egocentric action recognition. In: CVPR, pp. 9954–9963 (2019)
Google Scholar
Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., Hardt, M.: Test-time training with self-supervision for generalization under distribution shifts. In: ICML, pp. 9229–9248. PMLR (2020)
Google Scholar
Thapar, D., Nigam, A., Arora, C.: Anonymizing egocentric videos. In: ICCV, pp. 2320–2329 (2021)
Google Scholar
Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR 2011, pp. 1521–1528. IEEE (2011)
Google Scholar
Volpi, R., Namkoong, H., Sener, O., Duchi, J.C., Murino, V., Savarese, S.: Generalizing to unseen domains via adversarial data augmentation. In: NIPS, pp. 5334–5344 (2018)
Google Scholar
Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007). https://doi.org/10.1007/s11222-007-9033-z
Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726 (2020)
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VIII. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal classification networks hard? In: CVPR, pp. 12695–12705 (2020)
Google Scholar
Wu, X., Zhou, Q., Yang, Z., Zhao, C., Latecki, L.J., et al.: Entropy minimization vs. diversity maximization for domain adaptation. arXiv:2002.01690 (2020)
Yao, Z., Wang, Y., Wang, J., Yu, P., Long, M.: VideoDG: generalizing temporal relations in videos to novel domains. TPAMI (2021)
Google Scholar
Ye, J., Lu, X., Lin, Z., Wang, J.Z.: Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In: ICLR (2018)
Google Scholar
You, F., Li, J., Zhao, Z.: Test-time batch statistics calibration for covariate shift. arXiv preprint arXiv:2110.04065 (2021)
Zhao, J., Snoek, C.G.: Dance with flow: two-in-one stream action detection. In: CVPR, pp. 9935–9944 (2019)
Google Scholar

Download references

Acknowledgements

This work was supported by the CINI Consortium through the VIDESEC project.

Author information

Authors and Affiliations

Politecnico di Torino, Turin, Italy
Mirco Plananamente, Chiara Plizzari & Barbara Caputo
CINI Consortium, Venice, Italy
Mirco Plananamente

Authors

Mirco Plananamente
View author publications
You can also search for this author in PubMed Google Scholar
Chiara Plizzari
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Caputo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chiara Plizzari .

Editor information

Editors and Affiliations

Boston University, Boston, MA, USA
Stan Sclaroff
National Research Council, Lecce, Italy
Cosimo Distante
National Research Council, Lecce, Italy
Marco Leo
University of Catania, Catania, Italy
Giovanni M. Farinella
Technische Universität München, Garching, Germany
Federico Tombari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Plananamente, M., Plizzari, C., Caputo, B. (2022). Test-Time Adaptation for Egocentric Action Recognition. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13233. Springer, Cham. https://doi.org/10.1007/978-3-031-06433-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-06433-3_18
Published: 15 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06432-6
Online ISBN: 978-3-031-06433-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics