Made to Order: Discovering Monotonic Temporal Changes via Self-supervised Video Ordering

Yang, Charig; Xie, Weidi; Zisserman, Andrew

doi:10.1007/978-3-031-72904-1_16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15132))

Included in the following conference series:

European Conference on Computer Vision

304 Accesses

Abstract

Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with ‘time’ serving as a supervisory signal, since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a transformer-based model for ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple domains covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state-of-the-art on standard benchmarks for image ordering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning Actionness via Long-Range Temporal Order Verification

RRS: An Explainable Model Scale Search Strategy

Sustained Self-Supervised Pretraining for Temporal Order Verification

References

Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. arXiv preprint arXiv:2005.00928 (2020)
Afouras, T., Owens, A., Chung, J.S., Zisserman, A.: Self-supervised learning of audio-visual objects from video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12363, pp. 208–224. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58523-5_13
Chapter Google Scholar
Arandjelovic, R., Zisserman, A.: Objects that sound. In: ECCV (2018)
Google Scholar
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: ICCV (2021)
Google Scholar
Basha, T., Moses, Y., Avidan, S.: Photo sequencing. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 654–667. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_47
Chapter Google Scholar
Basha, T.D., Moses, Y., Avidan, S.: Space-time tradeoffs in photo sequencing. In: ICCV (2013)
Google Scholar
Benaim, S., et al.: Speednet: learning the speediness in videos. In: CVPR (2020)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
Google Scholar
Bian, Z., Jabri, A., Efros, A.A., Owens, A.: Learning pixel trajectories with multiscale contrastive random walks. In: CVPR (2022)
Google Scholar
Bideau, P., Learned-Miller, E.: A detailed rubric for motion segmentation. arXiv preprint arXiv:1610.10033 (2016)
Bideau, P., Learned-Miller, E.: It’s moving! A probabilistic model for causal motion segmentation in moving camera videos. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 433–449. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_26
Chapter Google Scholar
Blinkouskaya, Y., Weickenmeier, J.: Brain shape changes associated with cerebral atrophy in healthy aging and Alzheimer’s disease. Front. Mech. Eng. (2021)
Google Scholar
Brown, A., Xie, W., Kalogeiton, V., Zisserman, A.: Smooth-AP: smoothing the path towards large-scale image retrieval. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 677–694. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_39
Chapter Google Scholar
Chen, H., Xie, W., Afouras, T., Nagrani, A., Vedaldi, A., Zisserman, A.: Localizing visual sounds the hard way. In: CVPR (2021)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2021)
Google Scholar
Chen, X., Qiu, X., Huang, X.: Neural sentence ordering. arXiv preprint arXiv:1607.06952 (2016)
Cui, B., Li, Y., Chen, M., Zhang, Z.: Deep attentive sentence ordering network. In: EMNLP (2018)
Google Scholar
Cuturi, M., Teboul, O., Vert, J.P.: Differentiable ranking and sorting using optimal transport. In: NeurIPS (2019)
Google Scholar
Fernando, B., Bilen, H., Gavves, E., Gould, S.: Self-supervised video representation learning with odd-one-out networks. In: CVPR (2017)
Google Scholar
Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: CVPR (2015)
Google Scholar
Fong, R., Patrick, M., Vedaldi, A.: Understanding deep networks via extremal perturbations and smooth masks. In: ICCV (2019)
Google Scholar
Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningful perturbation. In: ICCV (2017)
Google Scholar
Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: ICCV (2019)
Google Scholar
Goodfellow, I.J., Bulatov, Y., Ibarz, J., Arnoud, S., Shet, V.: Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv preprint arXiv:1312.6082 (2013)
Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. In: NeurIPS (2021)
Google Scholar
Grover, A., Wang, E., Zweig, A., Ermon, S.: Stochastic optimization of sorting networks via continuous relaxations. arXiv preprint arXiv:1903.08850 (2019)
Hafner, S., Ban, Y., Nascetti, A.: Urban change detection using a dual-task siamese network and semi-supervised learning. In: IGARSS (2022)
Google Scholar
Han, T., Xie, W., Zisserman, A.: Self-supervised co-training for video representation learning. In: NeurIPS (2020)
Google Scholar
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
Google Scholar
Iashin, V., Xie, W., Rahtu, E., Zisserman, A.: Sparse in space and time: audio-visual synchronisation with trainable selectors. In: BMVC (2022)
Google Scholar
Jabri, A., Owens, A., Efros, A.A.: Space-time correspondence as a contrastive random walk. In: NeurIPS (2020)
Google Scholar
Kim, H., Sabuncu, M.R.: Learning to compare longitudinal images. arXiv preprint arXiv:2304.02531 (2023)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Google Scholar
Lai, Z., Lu, E., Xie, W.: Mast: a memory-augmented self-supervised tracker. In: CVPR (2020)
Google Scholar
Lamdouar, H., Xie, W., Zisserman, A.: Segmenting invisible moving objects. In: BMVC (2021)
Google Scholar
Lamdouar, H., Yang, C., Xie, W., Zisserman, A.: Betrayed by motion: camouflaged object discovery via motion segmentation. In: ACCV (2020)
Google Scholar
LaMontagne, P.J., et al.: Oasis-3: longitudinal neuroimaging, clinical, and cognitive dataset for normal aging and Alzheimer disease. MedRxiv (2019)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE (1998)
Google Scholar
Lee, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Unsupervised representation learning by sorting sequences. In: ICCV (2017)
Google Scholar
Liu, J., Ju, C., Xie, W., Zhang, Y.: Exploiting transformation invariance and equivariance for self-supervised sound localisation. In: ACM MM (2022)
Google Scholar
Liu, P., Lyu, M., King, I., Xu, J.: Selflow: self-supervised learning of optical flow. In: CVPR (2019)
Google Scholar
Malila, W.A.: Change vector analysis: an approach for detecting forest changes with landsat. In: LARS Symposia (1980)
Google Scholar
Mall, U., Hariharan, B., Bala, K.: Change event dataset for discovery from spatio-temporal remote sensing imagery. In: NeurIPS (2022)
Google Scholar
Mall, U., Hariharan, B., Bala, K.: Change-aware sampling and contrastive learning for satellite images. In: CVPR (2023)
Google Scholar
Meister, S., Hur, J., Roth, S.: Unflow: unsupervised learning of optical flow with a bidirectional census loss. In: AAAI (2018)
Google Scholar
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Chapter Google Scholar
Neff, R., Schwartz, S., Stork, D.G.: Electronics for generating simultaneous random-dot cyclopean and monocular stimuli. Behav. Res. Methods Instrum. Comput. (1985)
Google Scholar
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NeurIPS (2011)
Google Scholar
Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5
Chapter Google Scholar
Patriarche, J., Erickson, B.: A review of the automated detection of change in serial imaging studies of the brain. J. Digit. Imaging (2004)
Google Scholar
Petersen, F., Borgelt, C., Kuehne, H., Deussen, O.: Differentiable sorting networks for scalable sorting and ranking supervision. In: ICML (2021)
Google Scholar
Petersen, F., Borgelt, C., Kuehne, H., Deussen, O.: Monotonic differentiable sorting networks. arXiv preprint arXiv:2203.09630 (2022)
Sachdeva, R., Zisserman, A.: The change you want to see. In: WACV (2023)
Google Scholar
Saha, S., Bovolo, F., Bruzzone, L.: Unsupervised deep change vector analysis for multiple-change detection in VHR images. IEEE Trans. Geosci. Remote Sens. 57(6), 3677–3693 (2019)
Article MATH Google Scholar
Sakurada, K., Okatani, T.: Change detection from a street image pair using CNN features and superpixel segmentation. In: BMVC (2015)
Google Scholar
Scahill, R.I., Frost, C., Jenkins, R., Whitwell, J.L., Rossor, M.N., Fox, N.C.: A longitudinal study of brain volume changes in normal aging using serial registered magnetic resonance imaging. Arch. Neurol. 60(7), 989–994 (2003)
Article Google Scholar
Sevilla-Lara, L., Zha, S., Yan, Z., Goswami, V., Feiszli, M., Torresani, L.: Only time can tell: discovering temporal data for temporal modeling. In: WACV (2021)
Google Scholar
Shvetsova, N., Petersen, F., Kukleva, A., Schiele, B., Kuehne, H.: Learning by sorting: self-supervised learning with group ordering constraints. ICCV (2023)
Google Scholar
Stent, S., Gherardi, R., Stenger, B., Cipolla, R.: Detecting change for multi-view, long-term surface inspection. In: BMVC (2015)
Google Scholar
Svennerholm, L., Boström, K., Jungbjer, B.: Changes in weight and compositions of major membrane components of human brain during the span of adult human life of swedes. Acta neuropathologica (1997)
Google Scholar
Van Etten, A., Hogan, D., Manso, J.M., Shermeyer, J., Weir, N., Lewis, R.: The multi-temporal urban development spacenet dataset. In: CVPR (2021)
Google Scholar
Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: NeurIPS (2015)
Google Scholar
Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: CVPR (2019)
Google Scholar
Wei, D., Lim, J.J., Zisserman, A., Freeman, W.T.: Learning and using the arrow of time. In: CVPR (2018)
Google Scholar
Xie, J., Xie, W., Zisserman, A.: Segmenting moving objects via an object-centric layered representation. In: NeurIPS (2022)
Google Scholar
Yang, C., Lamdouar, H., Lu, E., Zisserman, A., Xie, W.: Self-supervised video object segmentation by motion grouping. In: ICCV (2021)
Google Scholar
Yang, C., Xie, W., Zisserman, A.: It’s about time: analog clock reading in the wild. In: CVPR (2022)
Google Scholar
Zarrabi, N., Avidan, S., Moses, Y.: Crowdcam: dynamic region segmentation. arXiv preprint arXiv:1811.11455 (2018)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)
Google Scholar
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)
Google Scholar
Zhukov, D., Alayrac, J.-B., Laptev, I., Sivic, J.: Learning actionness via long-range temporal order verification. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12374, pp. 470–487. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58526-6_28
Chapter Google Scholar

Download references

Acknowledgements

We thank Tengda Han, Ragav Sachdeva, and Aleksandar Shtedritski for suggestions and proofreading. This research is supported by the UK EPSRC CDT in AIMS (EP/S024050/1), and the UK EPSRC Programme Grant Visual AI (EP/T028572/1).

Author information

Authors and Affiliations

Visual Geometry Group, University of Oxford, Oxford, UK
Charig Yang, Weidi Xie & Andrew Zisserman
CMIC, Shanghai Jiao Tong University, Shanghai, China
Weidi Xie

Authors

Charig Yang
View author publications
You can also search for this author in PubMed Google Scholar
Weidi Xie
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Zisserman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charig Yang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Hessen, Germany
Stefan Roth
Princeton University, Palo Alto, CA, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13502 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, C., Xie, W., Zisserman, A. (2025). Made to Order: Discovering Monotonic Temporal Changes via Self-supervised Video Ordering. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15132. Springer, Cham. https://doi.org/10.1007/978-3-031-72904-1_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-72904-1_16
Published: 21 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72903-4
Online ISBN: 978-3-031-72904-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Made to Order: Discovering Monotonic Temporal Changes via Self-supervised Video Ordering