Skip to main content

ViPro: Enabling and Controlling Video Prediction for Complex Dynamical Scenarios Using Procedural Knowledge

  • Conference paper
  • First Online:
Neural-Symbolic Learning and Reasoning (NeSy 2024)

Abstract

We propose a novel architecture design for video prediction in order to utilize procedural domain knowledge directly as part of the computational graph of data-driven models. On the basis of new challenging scenarios we show that state-of-the-art video predictors struggle in complex dynamical settings, and highlight that the introduction of prior process knowledge makes their learning problem feasible. Our approach results in the learning of a symbolically addressable interface between data-driven aspects in the model and our dedicated procedural knowledge module, which we utilize in downstream control tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Brockman, G., et al.: OpenAI gym. arXiv preprint arXiv:1606.01540 (2016)

  2. Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)

  3. Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: Proceedings of the 35th International Conference on Machine Learning, pp. 1174–1183. PMLR (2018)

    Google Scholar 

  4. Donà, J., Franceschi, J.Y., Lamprier, S., Gallinari, P.: PDE-driven spatiotemporal disentanglement. In: International Conference on Learning Representations (2021)

    Google Scholar 

  5. Greff, K., et al.: Kubric: a scalable dataset generator. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3739–3751 (2022). https://doi.org/10.1109/CVPR52688.2022.00373

  6. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015)

    Google Scholar 

  7. Jaques, M., Burke, M., Hospedales, T.: Physics-as-inverse-graphics: unsupervised physical parameter estimation from video. In: International Conference on Learning Representations (2019)

    Google Scholar 

  8. Jumper, J., et al.: Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2

    Article  Google Scholar 

  9. Kandukuri, R.K., Achterhold, J., Moeller, M., Stueckler, J.: Physical representation learning and parameter identification from video using differentiable physics. Int. J. Comput. Vision 130(1), 3–16 (2022). https://doi.org/10.1007/s11263-021-01493-5

    Article  Google Scholar 

  10. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015)

    Google Scholar 

  11. Kipf, T., et al.: Conditional object-centric learning from video. In: International Conference on Learning Representations (2022)

    Google Scholar 

  12. Kosiorek, A., Kim, H., Teh, Y.W., Posner, I.: Sequential attend, infer, repeat: generative modelling of moving objects. In: Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018)

    Google Scholar 

  13. Le Guen, V., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11471–11481. IEEE, Seattle, WA, USA (2020). https://doi.org/10.1109/CVPR42600.2020.01149

  14. Lin, Z., Wu, Y.F., Peri, S., Fu, B., Jiang, J., Ahn, S.: Improving generative imagination in object-centric world models. In: Proceedings of the 37th International Conference on Machine Learning, pp. 6140–6149. PMLR (2020)

    Google Scholar 

  15. Locatello, F., et al.: Object-centric learning with slot attention. In: Advances in Neural Information Processing Systems, vol. 33, pp. 11525–11538. Curran Associates, Inc. (2020)

    Google Scholar 

  16. Marcus, G., Davis, E.: Rebooting AI: Building Artificial Intelligence We Can Trust. Pantheon Books, USA (2019)

    Google Scholar 

  17. Murthy, J.K., et al.: gradSim: differentiable simulation for system identification and visuomotor control. In: International Conference on Learning Representations (2020)

    Google Scholar 

  18. Musielak, Z.E., Quarles, B.: The three-body problem. Rep. Prog. Phys. 77(6), 065901 (2014). https://doi.org/10.1088/0034-4885/77/6/065901

    Article  MathSciNet  Google Scholar 

  19. Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)

    Google Scholar 

  20. Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019). https://doi.org/10.1016/j.jcp.2018.10.045

    Article  MathSciNet  Google Scholar 

  21. Takenaka, P., Maucher, J., Huber, M.F.: Guiding video prediction with explicit procedural knowledge. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 1084–1092 (2023)

    Google Scholar 

  22. Traub, M., Otte, S., Menge, T., Karlbauer, M., Thuemmel, J., Butz, M.V.: Learning what and where: disentangling location and identity tracking without supervision. In: The Eleventh International Conference on Learning Representations (2023)

    Google Scholar 

  23. Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)

  24. von Rueden, L., et al.: Informed machine learning - a taxonomy and survey of integrating prior knowledge into learning systems. IEEE Trans. Knowl. Data Eng. 35(1), 614–633 (2023). https://doi.org/10.1109/TKDE.2021.3079836

    Article  Google Scholar 

  25. Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: PredRNN: recurrent neural networks for predictive learning using spatiotemporal LSTMs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 879–888. NIPS’17, Curran Associates Inc., Red Hook, NY, USA (2017)

    Google Scholar 

  26. Wang, Y., et al.: PredRNN: a recurrent neural network for spatiotemporal predictive learning. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 2208–2225 (2023). https://doi.org/10.1109/TPAMI.2022.3165153

    Article  Google Scholar 

  27. Watters, N., Matthey, L., Burgess, C.P., Lerchner, A.: Spatial broadcast decoder: a simple architecture for learning disentangled representations in VAEs. arXiv preprint arXiv:1901.07017 (2019)

  28. Watters, N., Zoran, D., Weber, T., Battaglia, P., Pascanu, R., Tacchetti, A.: Visual interaction networks: learning a physics simulator from video. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)

    Google Scholar 

  29. Wu, X., Lu, J., Yan, Z., Zhang, G.: Disentangling stochastic PDE dynamics for unsupervised video prediction. In: IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15 (2023). https://doi.org/10.1109/TNNLS.2023.3286890

  30. Wu, Z., Dvornik, N., Greff, K., Kipf, T., Garg, A.: SlotFormer: unsupervised visual dynamics simulation with object-centric models. In: The Eleventh International Conference on Learning Representations (2023)

    Google Scholar 

  31. Wu, Z., Hu, J., Lu, W., Gilitschenski, I., Garg, A.: SlotDiffusion: object-centric generative modeling with diffusion models (2023). https://openreview.net/forum?id=ETk6cfS3vk

  32. Xu, J., Zhang, Z., Friedman, T., Liang, Y., Broeck, G.: A semantic loss function for deep learning with symbolic knowledge. In: Proceedings of the 35th International Conference on Machine Learning, pp. 5502–5511. PMLR (2018)

    Google Scholar 

  33. Yang, T.Y., Rosca, J.P., Narasimhan, K.R., Ramadge, P.: Learning physics constrained dynamics using autoencoders. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  34. Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. In: International Conference on Learning Representations (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patrick Takenaka .

Editor information

Editors and Affiliations

Appendices

A Further Implementation Details

In the following we describe the core components of our architecture in more detail.

1.1 A.1 Video Frame Encoder

The used video frame encoder is a standard CNN. The input video frames are encoded in parallel by merging the temporal dimension T with the batch dimension B. The CNN consists of four convolutional layers, each with a filter size of 64, kernel size of 5, and a stride of 1. In the non object-centric variant of our architecture, the output features are flattened and transformed by a final fully connected network, consisting of initial layer normalization, a single hidden layer with ReLU activation and a final output linear layer with \(C=768\) neurons each. The result is a latent vector of size \(B\times T\times C\) that serves as input to P.

In the object-centric variant, a position embedding is additionally applied after the CNN, and only the spatial dimensions H and W are flattened before the transformation of the fully connected network, with C reduced to 128. The result is a latent vector of size \(B\times T\times C \times H \times W\). In each burn-in iteration of the object-centric variant, we use the Slot Attention mechanism [15] to obtain updated object latent vectors before applying P.

1.2 A.2 Procedural Knowledge Module

P is responsible for predicting the latent vector of the next frame. It consists of the following submodules:

\(P_\textrm{in}\). Responsible for transforming the latent vector obtained from the image frame encoder into a separable latent vector z. It is implemented as a fully connected network with a single hidden layer using the ReLU activation function. All layers have a subsequent ReLU activation function. The number of neurons in all layers corresponds to C.

\(P_\textrm{out}\). Responsible for transforming z back into the latent image space. It has the same structure as \(P_\textrm{in}\).

\(F_\textrm{in}\). Responsible for transforming \(z_a\) within z into the symbolic space required for F. It is a single linear layer without bias neurons. In the object-centric case, its output size directly corresponds to the number of parameters required for F \(N_\textrm{param}\) for a single object. In the non object-centric case when there is no separate object dimension available, it instead corresponds to \(N_\textrm{param} \times N_\textrm{objects}\), where \(N_\textrm{objects}\) corresponds to the (fixed) number of objects (if present in the dataset).

F. Contains the integrated function directly as part of the computational graph. Details about F for the individual data scenarios can be found in Appendix E.

\(F_\textrm{out}\). Same structure as \(F_\textrm{in}\), with the input and output sizes reversed.

R. Responsible for modelling residual dynamics not handled by F. We implement it as a transformer [23] with two layers and four heads. We set the latent size to C and the dimension of its feed-forward network to 512. It takes into account the most recent 6 frame encodings. Its output corresponds to \(z_b\). A temporal position embedding is applied before the transformer.

We first transform the latent image vector into a separable latent vector z by transforming it with \(P_\textrm{in}\). We then split z of size C into the three equally sized components \(z_a\), \(z_b\), and \(z_c\). We continue by obtaining their respective next frame predictions \(\hat{z}_a\), \(\hat{z}_b\), and \(\hat{z}_c\) as follows: \(\hat{z}_a\) by F, \(z_b\) by transforming z with R, and \(\hat{z}_c\) directly corresponds to \(z_c\). All three components are merged back together and transformed into the image latent space with \(P_\textrm{out}\) before decoding.

1.3 A.3 Video Frame Decoder

We implement the video frame decoder as a Spatial Broadcast Decoder [27]. We set the resolution for the spatial broadcast to 8, and first apply positional embedding on the expanded latent vector. We then transform the output by four deconvolutional layers, each with filter size 64. We add a final convolutional layer with filter size of 3 to obtain the decoded image. We set the strides to 2 in each layer until we arrive at the desired output resolution of 64, after which we use a stride of 1. In the object-centric variant, we set the output filter size to 4 and use the first channel as weights w. We then reduce the object dimension after the decoding as in  [15] by normalizing the object dimension of w via softmax, and using it to calculate a weighted sum with the object dimensions of the RGB output channels.

1.4 A.4 Training Details

We train all models for at maximum 500k iterations each or until convergence is observed by early stopping. We clip gradients to a maximum norm of 0.05 and train using the Adam Optimizer [10] with an initial learning rate of \(2e^{-4}\). We set the loss weighting factor \(\lambda \) to 1. We set the batch size according to the available GPU memory, which was 32 in our case. We performed the experiments on four NVIDIA TITAN Xp with 12GB of VRAM, taking—on average—one to two days per run.

B Details for Comparison Models

Takenaka et al. [21]. We apply the training process and configuration as described in their paper, and instead use RGB reconstruction loss to fit into our training framework. We integrate the same procedural function here as in our model.

Slot Diffusion. [31]. We use the three-stage training process as described in the paper with all hyperparameters being set as recommended.

SlotFormer. [30]. We use their proposed training and architecture configuration for the CLEVRER [34] dataset, as its makeup is the most similar to our datasets and follow their proposed training regimen.

PhyDNet. We use their recommended training and architecture configuration without changes.

PredRNN-V2. We use their recommended configuration for the Moving MNIST dataset.

Dona et al.  [4]. We report the performance for their recommended configuration for the Sea Surface Temperature (SST) dataset, as it resulted in the best performance on our datasets.

C Further Dataset Details

In Table 3 we show further statistics of our introduced datasets.

Table 3. Detailed statistics of our introduced datasets.

D Orbits Control Validation Dataset Details

In the Orbits setting the object positions are part of the symbolic state, which are an integral factor of correctly rendering the output frame. However, it is not trivial to measure how well our model is able to decode “hand-controlled” 3D object positions into a 2D frame in a generalizable manner. Therefore we chose to setup an empirical evaluation framework by assembling variations of the Orbits dataset, ranging from different simulation parameters, over completely novel dynamics, up to non-physics settings such as trajectory following. For each validation set, we replace F of a model trained on the default Orbits dataset with the respective version that handles these new dynamics, and then validate the model without any retraining.

Table 4. LPIPS\(\downarrow \) Performance on the default Orbits dataset and the validation settings A-H. A: Increased frame rate; B: Increased gravitational constant; C: Tripled force; D: Repulsion instead of attraction; E: No forces; F: No forces and zero velocities; G: Objects follow set trajectories; H: Objects appear at random locations in each frame.

As can be seen in Table 4, the performance across all validation settings is comparable to the default dataset and thus, shows that the outputs of F work as a reliable control interface at test time. We note that the much lower LPIPS for test setting E is due to the objects quickly leaving the scene, resulting in mostly background scenes. Table 5 describes each setting in more detail.

Table 5. Detailed description of the Orbits dataset variants that are used to verify generative control over the integrated function parameters.

E Integrated Function Details

This section shows the functions integrated in our model. All functions first calculate the appropriate acceleration a before applying it in a semi-implicit euler integration step with a step size of \(\varDelta t\).

For the Orbits dataset each objects state consists of position p and velocity v. The environmental constants correspond to the gravitational constant g and object mass m. Given N objects in the scene at video frame t, the object state of the next time step \(t+1\) for any object n is obtained as follows:

$$\begin{aligned} F_{t,n} &= \sum _{\begin{array}{c} i=0\\ i\ne n \end{array}}^N \frac{(p_{t,i} - p_{t,n})}{|(p_{t,i} - p_{t,n})|}\frac{gm}{|(p_{t,i} - p_{t,n})|^2} \end{aligned}$$
(3)
$$\begin{aligned} a_{t,n} &= \frac{F_{t,n}}{m}\end{aligned}$$
(4)
$$\begin{aligned} v_{t+1,n} &= v_{t,n} + \varDelta t a_{t,n}\end{aligned}$$
(5)
$$\begin{aligned} p_{t+1,n} &= p_{t,n} + \varDelta t v_{t+1,n} \end{aligned}$$
(6)

For the Acrobot dataset the per-frame state consists of the pendulum angles \(\theta _1\) and \(\theta _2\) and their angular velocities \(\dot{\theta }_1\) and \(\dot{\theta }_2\). The environmental constants consist of the pendulum masses \(m_1\) and \(m_2\), the pendulum lengths \(l_1\) and \(l_2\), the link center of mass \(c_1\) and \(c_2\), the inertias \(I_1\) and \(I_2\), and the gravitational constant G. The pendulum state of the next time step \(t+1\) is calculated as follows:

$$\begin{aligned} \delta _{1_t} &= m_1 c_1^2 + m_2 (l_1^2 + c_2^2 + 2l_1 c_2 \cos {(\theta _{2_t})}) + I_1 + I_2\end{aligned}$$
(7)
$$\begin{aligned} \delta _{2_t} &= m_2 (c_2^2 + l_1 c_2 \cos {(\theta _{2_t})}) + I_2\end{aligned}$$
(8)
$$\begin{aligned} \phi _{2_t} &= m_2 c_2 G \cos {(\theta _{1_t} + \theta _{2_t} - \frac{\pi }{2})}\end{aligned}$$
(9)
$$\begin{aligned} \phi _{1_t} &= -m_2 l_1 c_2 \dot{\theta }_{2_t}^2 \sin {(\theta _{2_t})} - 2 m_2 l_1 c_2 \dot{\theta }_{2_t} \dot{\theta }_{1_t} \sin {(\theta _{2_t})}\end{aligned}$$
(10)
$$\begin{aligned} &\;\;\;\; + (m_1 c_1 + m_2 l_1) G \cos {(\theta _{1_t} - \frac{\pi }{2})} + \phi _{2_t}\end{aligned}$$
(11)
$$\begin{aligned} \ddot{\theta }_{2_t} &= \frac{\frac{\delta _{2_t}}{\delta _{1_t}} \phi _{1,t} - m_2 l_1 c_2 \dot{\theta }_{1,t}^2 \sin {(\theta _{2,t})} - \phi _{2,t}}{m_2 c_2^2 + I_2 - \frac{\delta _{2_t}^2}{\delta _{1_t}}}\end{aligned}$$
(12)
$$\begin{aligned} \ddot{\theta }_{1_t} &= -\frac{\delta _2 \ddot{\theta }_{2_t} + \phi _{1_t}}{\delta _{1_t}}\end{aligned}$$
(13)
$$\begin{aligned} \dot{\theta }_{1_{t+1}} &= \dot{\theta }_{1_{t}} + \varDelta t \ddot{\theta }_{1_t}\end{aligned}$$
(14)
$$\begin{aligned} \dot{\theta }_{2_{t+1}} &= \dot{\theta }_{2_{t}} + \varDelta t \ddot{\theta }_{2_t}\end{aligned}$$
(15)
$$\begin{aligned} \theta _{1_{t+1}} &= \theta _{1_{t}} + \varDelta t \dot{\theta }_{1_{t+1}}\end{aligned}$$
(16)
$$\begin{aligned} \theta _{2_{t+1}} &= \theta _{2_{t}} + \varDelta t \dot{\theta }_{2_{t+1}} \end{aligned}$$
(17)

The Pendulum Camera dataset follows the same equations of the Acrobot dataset to obtain an updated pendulum state. Afterwards, this state is used to obtain the new camera position \(p_{c_{t+1}}\):

$$\begin{aligned} p_{c_{t+1}} = \begin{bmatrix} p_x\\ p_y\\ p_z\\ \end{bmatrix} &= \begin{bmatrix} -2l_1 \sin {(\theta _{1_{t+1}})} - l_2 \sin {(\theta _{1_{t+1}} + \theta _{2_{t+1}})} \\ 2l_1 \cos {(\theta _{1_{t+1}})} + l_2 \cos {(\theta _{1_{t+1}} + \theta _{2_{t+1}})}\\ 10 \end{bmatrix} \end{aligned}$$
(18)

F MPC Details

We set the control objective as the maximization of the potential energy—i.e. both pendulums oriented upwards—and the minimization of the kinetic energy—i.e. resting pendulums. The system model corresponds to our integrated function F and due to already being discretized does not require further processing. We use a controller with a prediction horizon of 150 steps and store the predicted torque action sequence for the next 75 frames.

1.1 F.1 Qualitative Results

This section shows additional qualitative results for the MPC task.

Fig. 7.
figure 7

Qualitative results for the MPC task.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Takenaka, P., Maucher, J., Huber, M.F. (2024). ViPro: Enabling and Controlling Video Prediction for Complex Dynamical Scenarios Using Procedural Knowledge. In: Besold, T.R., d’Avila Garcez, A., Jimenez-Ruiz, E., Confalonieri, R., Madhyastha, P., Wagner, B. (eds) Neural-Symbolic Learning and Reasoning. NeSy 2024. Lecture Notes in Computer Science(), vol 14979. Springer, Cham. https://doi.org/10.1007/978-3-031-71167-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-71167-1_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-71166-4

  • Online ISBN: 978-3-031-71167-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics