ViPro: Enabling and Controlling Video Prediction for Complex Dynamical Scenarios Using Procedural Knowledge

Takenaka, Patrick; Maucher, Johannes; Huber, Marco F.

doi:10.1007/978-3-031-71167-1_4

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14979))

Included in the following conference series:

International Conference on Neural-Symbolic Learning and Reasoning

623 Accesses

Abstract

We propose a novel architecture design for video prediction in order to utilize procedural domain knowledge directly as part of the computational graph of data-driven models. On the basis of new challenging scenarios we show that state-of-the-art video predictors struggle in complex dynamical settings, and highlight that the introduction of prior process knowledge makes their learning problem feasible. Our approach results in the learning of a symbolically addressable interface between data-driven aspects in the model and our dedicated procedural knowledge module, which we utilize in downstream control tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Box2Flow: Instance-Based Action Flow Graphs from Videos

DYAN: A Dynamical Atoms-Based Network for Video Prediction

Towards Neuro-Symbolic Video Understanding

References

Brockman, G., et al.: OpenAI gym. arXiv preprint arXiv:1606.01540 (2016)
Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: Proceedings of the 35th International Conference on Machine Learning, pp. 1174–1183. PMLR (2018)
Google Scholar
Donà, J., Franceschi, J.Y., Lamprier, S., Gallinari, P.: PDE-driven spatiotemporal disentanglement. In: International Conference on Learning Representations (2021)
Google Scholar
Greff, K., et al.: Kubric: a scalable dataset generator. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3739–3751 (2022). https://doi.org/10.1109/CVPR52688.2022.00373
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015)
Google Scholar
Jaques, M., Burke, M., Hospedales, T.: Physics-as-inverse-graphics: unsupervised physical parameter estimation from video. In: International Conference on Learning Representations (2019)
Google Scholar
Jumper, J., et al.: Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2
Article Google Scholar
Kandukuri, R.K., Achterhold, J., Moeller, M., Stueckler, J.: Physical representation learning and parameter identification from video using differentiable physics. Int. J. Comput. Vision 130(1), 3–16 (2022). https://doi.org/10.1007/s11263-021-01493-5
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015)
Google Scholar
Kipf, T., et al.: Conditional object-centric learning from video. In: International Conference on Learning Representations (2022)
Google Scholar
Kosiorek, A., Kim, H., Teh, Y.W., Posner, I.: Sequential attend, infer, repeat: generative modelling of moving objects. In: Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018)
Google Scholar
Le Guen, V., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11471–11481. IEEE, Seattle, WA, USA (2020). https://doi.org/10.1109/CVPR42600.2020.01149
Lin, Z., Wu, Y.F., Peri, S., Fu, B., Jiang, J., Ahn, S.: Improving generative imagination in object-centric world models. In: Proceedings of the 37th International Conference on Machine Learning, pp. 6140–6149. PMLR (2020)
Google Scholar
Locatello, F., et al.: Object-centric learning with slot attention. In: Advances in Neural Information Processing Systems, vol. 33, pp. 11525–11538. Curran Associates, Inc. (2020)
Google Scholar
Marcus, G., Davis, E.: Rebooting AI: Building Artificial Intelligence We Can Trust. Pantheon Books, USA (2019)
Google Scholar
Murthy, J.K., et al.: gradSim: differentiable simulation for system identification and visuomotor control. In: International Conference on Learning Representations (2020)
Google Scholar
Musielak, Z.E., Quarles, B.: The three-body problem. Rep. Prog. Phys. 77(6), 065901 (2014). https://doi.org/10.1088/0034-4885/77/6/065901
Article MathSciNet Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)
Google Scholar
Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019). https://doi.org/10.1016/j.jcp.2018.10.045
Article MathSciNet Google Scholar
Takenaka, P., Maucher, J., Huber, M.F.: Guiding video prediction with explicit procedural knowledge. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 1084–1092 (2023)
Google Scholar
Traub, M., Otte, S., Menge, T., Karlbauer, M., Thuemmel, J., Butz, M.V.: Learning what and where: disentangling location and identity tracking without supervision. In: The Eleventh International Conference on Learning Representations (2023)
Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
von Rueden, L., et al.: Informed machine learning - a taxonomy and survey of integrating prior knowledge into learning systems. IEEE Trans. Knowl. Data Eng. 35(1), 614–633 (2023). https://doi.org/10.1109/TKDE.2021.3079836
Article Google Scholar
Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: PredRNN: recurrent neural networks for predictive learning using spatiotemporal LSTMs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 879–888. NIPS’17, Curran Associates Inc., Red Hook, NY, USA (2017)
Google Scholar
Wang, Y., et al.: PredRNN: a recurrent neural network for spatiotemporal predictive learning. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 2208–2225 (2023). https://doi.org/10.1109/TPAMI.2022.3165153
Article Google Scholar
Watters, N., Matthey, L., Burgess, C.P., Lerchner, A.: Spatial broadcast decoder: a simple architecture for learning disentangled representations in VAEs. arXiv preprint arXiv:1901.07017 (2019)
Watters, N., Zoran, D., Weber, T., Battaglia, P., Pascanu, R., Tacchetti, A.: Visual interaction networks: learning a physics simulator from video. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)
Google Scholar
Wu, X., Lu, J., Yan, Z., Zhang, G.: Disentangling stochastic PDE dynamics for unsupervised video prediction. In: IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15 (2023). https://doi.org/10.1109/TNNLS.2023.3286890
Wu, Z., Dvornik, N., Greff, K., Kipf, T., Garg, A.: SlotFormer: unsupervised visual dynamics simulation with object-centric models. In: The Eleventh International Conference on Learning Representations (2023)
Google Scholar
Wu, Z., Hu, J., Lu, W., Gilitschenski, I., Garg, A.: SlotDiffusion: object-centric generative modeling with diffusion models (2023). https://openreview.net/forum?id=ETk6cfS3vk
Xu, J., Zhang, Z., Friedman, T., Liang, Y., Broeck, G.: A semantic loss function for deep learning with symbolic knowledge. In: Proceedings of the 35th International Conference on Machine Learning, pp. 5502–5511. PMLR (2018)
Google Scholar
Yang, T.Y., Rosca, J.P., Narasimhan, K.R., Ramadge, P.: Learning physics constrained dynamics using autoencoders. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. In: International Conference on Learning Representations (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Applied AI, Hochschule der Medien Stuttgart, Stuttgart, Germany
Patrick Takenaka & Johannes Maucher
Institute of Industrial Manufacturing and Management IFF, University of Stuttgart, Stuttgart, Germany
Patrick Takenaka & Marco F. Huber
Fraunhofer Institute for Manufacturing Engineering and Automation IPA, Stuttgart, Germany
Marco F. Huber

Authors

Patrick Takenaka
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Maucher
View author publications
You can also search for this author in PubMed Google Scholar
Marco F. Huber
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patrick Takenaka .

Editor information

Editors and Affiliations

Sony AI, Barcelona, Spain
Tarek R. Besold
City, University of London, London, UK
Artur d’Avila Garcez
City, University of London, London, UK
Ernesto Jimenez-Ruiz
University of Padova, Padova, Italy
Roberto Confalonieri
City, University of London, London, UK
Pranava Madhyastha
City, University of London, London, UK
Benedikt Wagner

Appendices

A Further Implementation Details

In the following we describe the core components of our architecture in more detail.

1.1 A.1 Video Frame Encoder

The used video frame encoder is a standard CNN. The input video frames are encoded in parallel by merging the temporal dimension T with the batch dimension B. The CNN consists of four convolutional layers, each with a filter size of 64, kernel size of 5, and a stride of 1. In the non object-centric variant of our architecture, the output features are flattened and transformed by a final fully connected network, consisting of initial layer normalization, a single hidden layer with ReLU activation and a final output linear layer with $C=768$ neurons each. The result is a latent vector of size $B\times T\times C$ that serves as input to P.

In the object-centric variant, a position embedding is additionally applied after the CNN, and only the spatial dimensions H and W are flattened before the transformation of the fully connected network, with C reduced to 128. The result is a latent vector of size $B\times T\times C \times H \times W$. In each burn-in iteration of the object-centric variant, we use the Slot Attention mechanism [15] to obtain updated object latent vectors before applying P.

1.2 A.2 Procedural Knowledge Module

P is responsible for predicting the latent vector of the next frame. It consists of the following submodules:

$P_\textrm{in}$. Responsible for transforming the latent vector obtained from the image frame encoder into a separable latent vector z. It is implemented as a fully connected network with a single hidden layer using the ReLU activation function. All layers have a subsequent ReLU activation function. The number of neurons in all layers corresponds to C.

$P_\textrm{out}$. Responsible for transforming z back into the latent image space. It has the same structure as $P_\textrm{in}$.

$F_\textrm{in}$. Responsible for transforming $z_a$ within z into the symbolic space required for F. It is a single linear layer without bias neurons. In the object-centric case, its output size directly corresponds to the number of parameters required for F $N_\textrm{param}$ for a single object. In the non object-centric case when there is no separate object dimension available, it instead corresponds to $N_\textrm{param} \times N_\textrm{objects}$, where $N_\textrm{objects}$ corresponds to the (fixed) number of objects (if present in the dataset).

F. Contains the integrated function directly as part of the computational graph. Details about F for the individual data scenarios can be found in Appendix E.

$F_\textrm{out}$. Same structure as $F_\textrm{in}$, with the input and output sizes reversed.

R. Responsible for modelling residual dynamics not handled by F. We implement it as a transformer [23] with two layers and four heads. We set the latent size to C and the dimension of its feed-forward network to 512. It takes into account the most recent 6 frame encodings. Its output corresponds to $z_b$. A temporal position embedding is applied before the transformer.

We first transform the latent image vector into a separable latent vector z by transforming it with $P_\textrm{in}$. We then split z of size C into the three equally sized components $z_a$, $z_b$, and $z_c$. We continue by obtaining their respective next frame predictions $\hat{z}_a$, $\hat{z}_b$, and $\hat{z}_c$ as follows: $\hat{z}_a$ by F, $z_b$ by transforming z with R, and $\hat{z}_c$ directly corresponds to $z_c$. All three components are merged back together and transformed into the image latent space with $P_\textrm{out}$ before decoding.

1.3 A.3 Video Frame Decoder

We implement the video frame decoder as a Spatial Broadcast Decoder [27]. We set the resolution for the spatial broadcast to 8, and first apply positional embedding on the expanded latent vector. We then transform the output by four deconvolutional layers, each with filter size 64. We add a final convolutional layer with filter size of 3 to obtain the decoded image. We set the strides to 2 in each layer until we arrive at the desired output resolution of 64, after which we use a stride of 1. In the object-centric variant, we set the output filter size to 4 and use the first channel as weights w. We then reduce the object dimension after the decoding as in [15] by normalizing the object dimension of w via softmax, and using it to calculate a weighted sum with the object dimensions of the RGB output channels.

1.4 A.4 Training Details

We train all models for at maximum 500k iterations each or until convergence is observed by early stopping. We clip gradients to a maximum norm of 0.05 and train using the Adam Optimizer [10] with an initial learning rate of $2e^{-4}$. We set the loss weighting factor $\lambda $ to 1. We set the batch size according to the available GPU memory, which was 32 in our case. We performed the experiments on four NVIDIA TITAN Xp with 12GB of VRAM, taking—on average—one to two days per run.

B Details for Comparison Models

Takenaka et al. [21]. We apply the training process and configuration as described in their paper, and instead use RGB reconstruction loss to fit into our training framework. We integrate the same procedural function here as in our model.

Slot Diffusion. [31]. We use the three-stage training process as described in the paper with all hyperparameters being set as recommended.

SlotFormer. [30]. We use their proposed training and architecture configuration for the CLEVRER [34] dataset, as its makeup is the most similar to our datasets and follow their proposed training regimen.

PhyDNet. We use their recommended training and architecture configuration without changes.

PredRNN-V2. We use their recommended configuration for the Moving MNIST dataset.

Dona et al. [4]. We report the performance for their recommended configuration for the Sea Surface Temperature (SST) dataset, as it resulted in the best performance on our datasets.

C Further Dataset Details

In Table 3 we show further statistics of our introduced datasets.

Table 3. Detailed statistics of our introduced datasets.

Full size table

D Orbits Control Validation Dataset Details

In the Orbits setting the object positions are part of the symbolic state, which are an integral factor of correctly rendering the output frame. However, it is not trivial to measure how well our model is able to decode “hand-controlled” 3D object positions into a 2D frame in a generalizable manner. Therefore we chose to setup an empirical evaluation framework by assembling variations of the Orbits dataset, ranging from different simulation parameters, over completely novel dynamics, up to non-physics settings such as trajectory following. For each validation set, we replace F of a model trained on the default Orbits dataset with the respective version that handles these new dynamics, and then validate the model without any retraining.

Table 4. LPIPS$\downarrow $ Performance on the default Orbits dataset and the validation settings A-H. A: Increased frame rate; B: Increased gravitational constant; C: Tripled force; D: Repulsion instead of attraction; E: No forces; F: No forces and zero velocities; G: Objects follow set trajectories; H: Objects appear at random locations in each frame.

Full size table

As can be seen in Table 4, the performance across all validation settings is comparable to the default dataset and thus, shows that the outputs of F work as a reliable control interface at test time. We note that the much lower LPIPS for test setting E is due to the objects quickly leaving the scene, resulting in mostly background scenes. Table 5 describes each setting in more detail.

Table 5. Detailed description of the Orbits dataset variants that are used to verify generative control over the integrated function parameters.

Full size table

E Integrated Function Details

This section shows the functions integrated in our model. All functions first calculate the appropriate acceleration a before applying it in a semi-implicit euler integration step with a step size of $\varDelta t$.

For the Orbits dataset each objects state consists of position p and velocity v. The environmental constants correspond to the gravitational constant g and object mass m. Given N objects in the scene at video frame t, the object state of the next time step $t+1$ for any object n is obtained as follows:

$$\begin{aligned} F_{t,n} &= \sum _{\begin{array}{c} i=0\\ i\ne n \end{array}}^N \frac{(p_{t,i} - p_{t,n})}{|(p_{t,i} - p_{t,n})|}\frac{gm}{|(p_{t,i} - p_{t,n})|^2} \end{aligned}$$

(3)

$$\begin{aligned} a_{t,n} &= \frac{F_{t,n}}{m}\end{aligned}$$

(4)

$$\begin{aligned} v_{t+1,n} &= v_{t,n} + \varDelta t a_{t,n}\end{aligned}$$

(5)

$$\begin{aligned} p_{t+1,n} &= p_{t,n} + \varDelta t v_{t+1,n} \end{aligned}$$

(6)

For the Acrobot dataset the per-frame state consists of the pendulum angles $\theta _1$ and $\theta _2$ and their angular velocities $\dot{\theta }_1$ and $\dot{\theta }_2$. The environmental constants consist of the pendulum masses $m_1$ and $m_2$, the pendulum lengths $l_1$ and $l_2$, the link center of mass $c_1$ and $c_2$, the inertias $I_1$ and $I_2$, and the gravitational constant G. The pendulum state of the next time step $t+1$ is calculated as follows:

$$\begin{aligned} \delta _{1_t} &= m_1 c_1^2 + m_2 (l_1^2 + c_2^2 + 2l_1 c_2 \cos {(\theta _{2_t})}) + I_1 + I_2\end{aligned}$$

(7)

$$\begin{aligned} \delta _{2_t} &= m_2 (c_2^2 + l_1 c_2 \cos {(\theta _{2_t})}) + I_2\end{aligned}$$

(8)

$$\begin{aligned} \phi _{2_t} &= m_2 c_2 G \cos {(\theta _{1_t} + \theta _{2_t} - \frac{\pi }{2})}\end{aligned}$$

(9)

$$\begin{aligned} \phi _{1_t} &= -m_2 l_1 c_2 \dot{\theta }_{2_t}^2 \sin {(\theta _{2_t})} - 2 m_2 l_1 c_2 \dot{\theta }_{2_t} \dot{\theta }_{1_t} \sin {(\theta _{2_t})}\end{aligned}$$

(10)

$$\begin{aligned} &\;\;\;\; + (m_1 c_1 + m_2 l_1) G \cos {(\theta _{1_t} - \frac{\pi }{2})} + \phi _{2_t}\end{aligned}$$

(11)

$$\begin{aligned} \ddot{\theta }_{2_t} &= \frac{\frac{\delta _{2_t}}{\delta _{1_t}} \phi _{1,t} - m_2 l_1 c_2 \dot{\theta }_{1,t}^2 \sin {(\theta _{2,t})} - \phi _{2,t}}{m_2 c_2^2 + I_2 - \frac{\delta _{2_t}^2}{\delta _{1_t}}}\end{aligned}$$

(12)

$$\begin{aligned} \ddot{\theta }_{1_t} &= -\frac{\delta _2 \ddot{\theta }_{2_t} + \phi _{1_t}}{\delta _{1_t}}\end{aligned}$$

(13)

$$\begin{aligned} \dot{\theta }_{1_{t+1}} &= \dot{\theta }_{1_{t}} + \varDelta t \ddot{\theta }_{1_t}\end{aligned}$$

(14)

$$\begin{aligned} \dot{\theta }_{2_{t+1}} &= \dot{\theta }_{2_{t}} + \varDelta t \ddot{\theta }_{2_t}\end{aligned}$$

(15)

$$\begin{aligned} \theta _{1_{t+1}} &= \theta _{1_{t}} + \varDelta t \dot{\theta }_{1_{t+1}}\end{aligned}$$

(16)

$$\begin{aligned} \theta _{2_{t+1}} &= \theta _{2_{t}} + \varDelta t \dot{\theta }_{2_{t+1}} \end{aligned}$$

(17)

The Pendulum Camera dataset follows the same equations of the Acrobot dataset to obtain an updated pendulum state. Afterwards, this state is used to obtain the new camera position $p_{c_{t+1}}$:

$$\begin{aligned} p_{c_{t+1}} = \begin{bmatrix} p_x\\ p_y\\ p_z\\ \end{bmatrix} &= \begin{bmatrix} -2l_1 \sin {(\theta _{1_{t+1}})} - l_2 \sin {(\theta _{1_{t+1}} + \theta _{2_{t+1}})} \\ 2l_1 \cos {(\theta _{1_{t+1}})} + l_2 \cos {(\theta _{1_{t+1}} + \theta _{2_{t+1}})}\\ 10 \end{bmatrix} \end{aligned}$$

(18)

F MPC Details

We set the control objective as the maximization of the potential energy—i.e. both pendulums oriented upwards—and the minimization of the kinetic energy—i.e. resting pendulums. The system model corresponds to our integrated function F and due to already being discretized does not require further processing. We use a controller with a prediction horizon of 150 steps and store the predicted torque action sequence for the next 75 frames.

1.1 F.1 Qualitative Results

This section shows additional qualitative results for the MPC task.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Takenaka, P., Maucher, J., Huber, M.F. (2024). ViPro: Enabling and Controlling Video Prediction for Complex Dynamical Scenarios Using Procedural Knowledge. In: Besold, T.R., d’Avila Garcez, A., Jimenez-Ruiz, E., Confalonieri, R., Madhyastha, P., Wagner, B. (eds) Neural-Symbolic Learning and Reasoning. NeSy 2024. Lecture Notes in Computer Science(), vol 14979. Springer, Cham. https://doi.org/10.1007/978-3-031-71167-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-71167-1_4
Published: 10 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71166-4
Online ISBN: 978-3-031-71167-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

ViPro: Enabling and Controlling Video Prediction for Complex Dynamical Scenarios Using Procedural Knowledge

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Box2Flow: Instance-Based Action Flow Graphs from Videos

DYAN: A Dynamical Atoms-Based Network for Video Prediction

Towards Neuro-Symbolic Video Understanding

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Further Implementation Details

1.1 A.1 Video Frame Encoder

1.2 A.2 Procedural Knowledge Module

1.3 A.3 Video Frame Decoder

1.4 A.4 Training Details

B Details for Comparison Models

C Further Dataset Details

D Orbits Control Validation Dataset Details

E Integrated Function Details

F MPC Details

1.1 F.1 Qualitative Results

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us