Abstract
We propose a novel architecture design for video prediction in order to utilize procedural domain knowledge directly as part of the computational graph of data-driven models. On the basis of new challenging scenarios we show that state-of-the-art video predictors struggle in complex dynamical settings, and highlight that the introduction of prior process knowledge makes their learning problem feasible. Our approach results in the learning of a symbolically addressable interface between data-driven aspects in the model and our dedicated procedural knowledge module, which we utilize in downstream control tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Brockman, G., et al.: OpenAI gym. arXiv preprint arXiv:1606.01540 (2016)
Brown, T.B., et al.: Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020)
Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: Proceedings of the 35th International Conference on Machine Learning, pp. 1174–1183. PMLR (2018)
Donà, J., Franceschi, J.Y., Lamprier, S., Gallinari, P.: PDE-driven spatiotemporal disentanglement. In: International Conference on Learning Representations (2021)
Greff, K., et al.: Kubric: a scalable dataset generator. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3739–3751 (2022). https://doi.org/10.1109/CVPR52688.2022.00373
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015)
Jaques, M., Burke, M., Hospedales, T.: Physics-as-inverse-graphics: unsupervised physical parameter estimation from video. In: International Conference on Learning Representations (2019)
Jumper, J., et al.: Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2
Kandukuri, R.K., Achterhold, J., Moeller, M., Stueckler, J.: Physical representation learning and parameter identification from video using differentiable physics. Int. J. Comput. Vision 130(1), 3–16 (2022). https://doi.org/10.1007/s11263-021-01493-5
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015)
Kipf, T., et al.: Conditional object-centric learning from video. In: International Conference on Learning Representations (2022)
Kosiorek, A., Kim, H., Teh, Y.W., Posner, I.: Sequential attend, infer, repeat: generative modelling of moving objects. In: Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018)
Le Guen, V., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11471–11481. IEEE, Seattle, WA, USA (2020). https://doi.org/10.1109/CVPR42600.2020.01149
Lin, Z., Wu, Y.F., Peri, S., Fu, B., Jiang, J., Ahn, S.: Improving generative imagination in object-centric world models. In: Proceedings of the 37th International Conference on Machine Learning, pp. 6140–6149. PMLR (2020)
Locatello, F., et al.: Object-centric learning with slot attention. In: Advances in Neural Information Processing Systems, vol. 33, pp. 11525–11538. Curran Associates, Inc. (2020)
Marcus, G., Davis, E.: Rebooting AI: Building Artificial Intelligence We Can Trust. Pantheon Books, USA (2019)
Murthy, J.K., et al.: gradSim: differentiable simulation for system identification and visuomotor control. In: International Conference on Learning Representations (2020)
Musielak, Z.E., Quarles, B.: The three-body problem. Rep. Prog. Phys. 77(6), 065901 (2014). https://doi.org/10.1088/0034-4885/77/6/065901
Ouyang, L., et al.: Training language models to follow instructions with human feedback. Adv. Neural. Inf. Process. Syst. 35, 27730–27744 (2022)
Raissi, M., Perdikaris, P., Karniadakis, G.E.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019). https://doi.org/10.1016/j.jcp.2018.10.045
Takenaka, P., Maucher, J., Huber, M.F.: Guiding video prediction with explicit procedural knowledge. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 1084–1092 (2023)
Traub, M., Otte, S., Menge, T., Karlbauer, M., Thuemmel, J., Butz, M.V.: Learning what and where: disentangling location and identity tracking without supervision. In: The Eleventh International Conference on Learning Representations (2023)
Vaswani, A., et al.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
von Rueden, L., et al.: Informed machine learning - a taxonomy and survey of integrating prior knowledge into learning systems. IEEE Trans. Knowl. Data Eng. 35(1), 614–633 (2023). https://doi.org/10.1109/TKDE.2021.3079836
Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: PredRNN: recurrent neural networks for predictive learning using spatiotemporal LSTMs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 879–888. NIPS’17, Curran Associates Inc., Red Hook, NY, USA (2017)
Wang, Y., et al.: PredRNN: a recurrent neural network for spatiotemporal predictive learning. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 2208–2225 (2023). https://doi.org/10.1109/TPAMI.2022.3165153
Watters, N., Matthey, L., Burgess, C.P., Lerchner, A.: Spatial broadcast decoder: a simple architecture for learning disentangled representations in VAEs. arXiv preprint arXiv:1901.07017 (2019)
Watters, N., Zoran, D., Weber, T., Battaglia, P., Pascanu, R., Tacchetti, A.: Visual interaction networks: learning a physics simulator from video. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)
Wu, X., Lu, J., Yan, Z., Zhang, G.: Disentangling stochastic PDE dynamics for unsupervised video prediction. In: IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15 (2023). https://doi.org/10.1109/TNNLS.2023.3286890
Wu, Z., Dvornik, N., Greff, K., Kipf, T., Garg, A.: SlotFormer: unsupervised visual dynamics simulation with object-centric models. In: The Eleventh International Conference on Learning Representations (2023)
Wu, Z., Hu, J., Lu, W., Gilitschenski, I., Garg, A.: SlotDiffusion: object-centric generative modeling with diffusion models (2023). https://openreview.net/forum?id=ETk6cfS3vk
Xu, J., Zhang, Z., Friedman, T., Liang, Y., Broeck, G.: A semantic loss function for deep learning with symbolic knowledge. In: Proceedings of the 35th International Conference on Machine Learning, pp. 5502–5511. PMLR (2018)
Yang, T.Y., Rosca, J.P., Narasimhan, K.R., Ramadge, P.: Learning physics constrained dynamics using autoencoders. In: Advances in Neural Information Processing Systems (2022)
Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. In: International Conference on Learning Representations (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Further Implementation Details
In the following we describe the core components of our architecture in more detail.
1.1 A.1 Video Frame Encoder
The used video frame encoder is a standard CNN. The input video frames are encoded in parallel by merging the temporal dimension T with the batch dimension B. The CNN consists of four convolutional layers, each with a filter size of 64, kernel size of 5, and a stride of 1. In the non object-centric variant of our architecture, the output features are flattened and transformed by a final fully connected network, consisting of initial layer normalization, a single hidden layer with ReLU activation and a final output linear layer with \(C=768\) neurons each. The result is a latent vector of size \(B\times T\times C\) that serves as input to P.
In the object-centric variant, a position embedding is additionally applied after the CNN, and only the spatial dimensions H and W are flattened before the transformation of the fully connected network, with C reduced to 128. The result is a latent vector of size \(B\times T\times C \times H \times W\). In each burn-in iteration of the object-centric variant, we use the Slot Attention mechanism [15] to obtain updated object latent vectors before applying P.
1.2 A.2 Procedural Knowledge Module
P is responsible for predicting the latent vector of the next frame. It consists of the following submodules:
\(P_\textrm{in}\). Responsible for transforming the latent vector obtained from the image frame encoder into a separable latent vector z. It is implemented as a fully connected network with a single hidden layer using the ReLU activation function. All layers have a subsequent ReLU activation function. The number of neurons in all layers corresponds to C.
\(P_\textrm{out}\). Responsible for transforming z back into the latent image space. It has the same structure as \(P_\textrm{in}\).
\(F_\textrm{in}\). Responsible for transforming \(z_a\) within z into the symbolic space required for F. It is a single linear layer without bias neurons. In the object-centric case, its output size directly corresponds to the number of parameters required for F \(N_\textrm{param}\) for a single object. In the non object-centric case when there is no separate object dimension available, it instead corresponds to \(N_\textrm{param} \times N_\textrm{objects}\), where \(N_\textrm{objects}\) corresponds to the (fixed) number of objects (if present in the dataset).
F. Contains the integrated function directly as part of the computational graph. Details about F for the individual data scenarios can be found in Appendix E.
\(F_\textrm{out}\). Same structure as \(F_\textrm{in}\), with the input and output sizes reversed.
R. Responsible for modelling residual dynamics not handled by F. We implement it as a transformer [23] with two layers and four heads. We set the latent size to C and the dimension of its feed-forward network to 512. It takes into account the most recent 6 frame encodings. Its output corresponds to \(z_b\). A temporal position embedding is applied before the transformer.
We first transform the latent image vector into a separable latent vector z by transforming it with \(P_\textrm{in}\). We then split z of size C into the three equally sized components \(z_a\), \(z_b\), and \(z_c\). We continue by obtaining their respective next frame predictions \(\hat{z}_a\), \(\hat{z}_b\), and \(\hat{z}_c\) as follows: \(\hat{z}_a\) by F, \(z_b\) by transforming z with R, and \(\hat{z}_c\) directly corresponds to \(z_c\). All three components are merged back together and transformed into the image latent space with \(P_\textrm{out}\) before decoding.
1.3 A.3 Video Frame Decoder
We implement the video frame decoder as a Spatial Broadcast Decoder [27]. We set the resolution for the spatial broadcast to 8, and first apply positional embedding on the expanded latent vector. We then transform the output by four deconvolutional layers, each with filter size 64. We add a final convolutional layer with filter size of 3 to obtain the decoded image. We set the strides to 2 in each layer until we arrive at the desired output resolution of 64, after which we use a stride of 1. In the object-centric variant, we set the output filter size to 4 and use the first channel as weights w. We then reduce the object dimension after the decoding as in [15] by normalizing the object dimension of w via softmax, and using it to calculate a weighted sum with the object dimensions of the RGB output channels.
1.4 A.4 Training Details
We train all models for at maximum 500k iterations each or until convergence is observed by early stopping. We clip gradients to a maximum norm of 0.05 and train using the Adam Optimizer [10] with an initial learning rate of \(2e^{-4}\). We set the loss weighting factor \(\lambda \) to 1. We set the batch size according to the available GPU memory, which was 32 in our case. We performed the experiments on four NVIDIA TITAN Xp with 12GB of VRAM, taking—on average—one to two days per run.
B Details for Comparison Models
Takenaka et al. [21]. We apply the training process and configuration as described in their paper, and instead use RGB reconstruction loss to fit into our training framework. We integrate the same procedural function here as in our model.
Slot Diffusion. [31]. We use the three-stage training process as described in the paper with all hyperparameters being set as recommended.
SlotFormer. [30]. We use their proposed training and architecture configuration for the CLEVRER [34] dataset, as its makeup is the most similar to our datasets and follow their proposed training regimen.
PhyDNet. We use their recommended training and architecture configuration without changes.
PredRNN-V2. We use their recommended configuration for the Moving MNIST dataset.
Dona et al. [4]. We report the performance for their recommended configuration for the Sea Surface Temperature (SST) dataset, as it resulted in the best performance on our datasets.
C Further Dataset Details
In Table 3 we show further statistics of our introduced datasets.
D Orbits Control Validation Dataset Details
In the Orbits setting the object positions are part of the symbolic state, which are an integral factor of correctly rendering the output frame. However, it is not trivial to measure how well our model is able to decode “hand-controlled” 3D object positions into a 2D frame in a generalizable manner. Therefore we chose to setup an empirical evaluation framework by assembling variations of the Orbits dataset, ranging from different simulation parameters, over completely novel dynamics, up to non-physics settings such as trajectory following. For each validation set, we replace F of a model trained on the default Orbits dataset with the respective version that handles these new dynamics, and then validate the model without any retraining.
As can be seen in Table 4, the performance across all validation settings is comparable to the default dataset and thus, shows that the outputs of F work as a reliable control interface at test time. We note that the much lower LPIPS for test setting E is due to the objects quickly leaving the scene, resulting in mostly background scenes. Table 5 describes each setting in more detail.
E Integrated Function Details
This section shows the functions integrated in our model. All functions first calculate the appropriate acceleration a before applying it in a semi-implicit euler integration step with a step size of \(\varDelta t\).
For the Orbits dataset each objects state consists of position p and velocity v. The environmental constants correspond to the gravitational constant g and object mass m. Given N objects in the scene at video frame t, the object state of the next time step \(t+1\) for any object n is obtained as follows:
For the Acrobot dataset the per-frame state consists of the pendulum angles \(\theta _1\) and \(\theta _2\) and their angular velocities \(\dot{\theta }_1\) and \(\dot{\theta }_2\). The environmental constants consist of the pendulum masses \(m_1\) and \(m_2\), the pendulum lengths \(l_1\) and \(l_2\), the link center of mass \(c_1\) and \(c_2\), the inertias \(I_1\) and \(I_2\), and the gravitational constant G. The pendulum state of the next time step \(t+1\) is calculated as follows:
The Pendulum Camera dataset follows the same equations of the Acrobot dataset to obtain an updated pendulum state. Afterwards, this state is used to obtain the new camera position \(p_{c_{t+1}}\):
F MPC Details
We set the control objective as the maximization of the potential energy—i.e. both pendulums oriented upwards—and the minimization of the kinetic energy—i.e. resting pendulums. The system model corresponds to our integrated function F and due to already being discretized does not require further processing. We use a controller with a prediction horizon of 150 steps and store the predicted torque action sequence for the next 75 frames.
1.1 F.1 Qualitative Results
This section shows additional qualitative results for the MPC task.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Takenaka, P., Maucher, J., Huber, M.F. (2024). ViPro: Enabling and Controlling Video Prediction for Complex Dynamical Scenarios Using Procedural Knowledge. In: Besold, T.R., d’Avila Garcez, A., Jimenez-Ruiz, E., Confalonieri, R., Madhyastha, P., Wagner, B. (eds) Neural-Symbolic Learning and Reasoning. NeSy 2024. Lecture Notes in Computer Science(), vol 14979. Springer, Cham. https://doi.org/10.1007/978-3-031-71167-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-71167-1_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71166-4
Online ISBN: 978-3-031-71167-1
eBook Packages: Computer ScienceComputer Science (R0)