Interpretable Intuitive Physics Model

Ye, Tian; Wang, Xiaolong; Davidson, James; Gupta, Abhinav

doi:10.1007/978-3-030-01258-8_6

Tian Ye¹⁷,
Xiaolong Wang¹⁷,
James Davidson¹⁸ &
…
Abhinav Gupta¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11216))

Included in the following conference series:

European Conference on Computer Vision

2172 Accesses

Abstract

Humans have a remarkable ability to use physical commonsense and predict the effect of collisions. But do they understand the underlying factors? Can they predict if the underlying factors have changed? Interestingly, in most cases humans can predict the effects of similar collisions with different conditions such as changes in mass, friction, etc. It is postulated this is primarily because we learn to model physics with meaningful latent variables. This does not imply we can estimate the precise values of these meaningful variables (estimate exact values of mass or friction). Inspired by this observation, we propose an interpretable intuitive physics model where specific dimensions in the bottleneck layers correspond to different physical properties. In order to demonstrate that our system models these underlying physical properties, we train our model on collisions of different shapes (cube, cone, cylinder, spheres etc.) and test on collisions of unseen combinations of shapes. Furthermore, we demonstrate our model generalizes well even when similar scenes are simulated with different underlying properties.

You have full access to this open access chapter, Download conference paper PDF

Intuitive understanding of the relationship between the elasticity of objects and kinematic patterns of collisions

Article 09 December 2015

Different Physical Intuitions Exist Between Tasks, Not Domains

Article 01 June 2018

PIP: Physical Interaction Prediction via Mental Simulation with Span Selection

Keywords

1 Introduction

Consider the collision image sequences shown in Fig. 1. When people see these images, they not only recognize the shapes and color of objects but also predict what is going to happen. For example, in the first sequence people can predict that the cylinder is going to rotate while in the second sequence the ball will bounce with no motion on cylinder. But beyond visual prediction, we can even infer the underlying latent factors which can help us explain the difference in visual predictions. For example, a possible explanation of the behavior between the two sequences, if we knew the ball’s mass didn’t change, is that the first sequence’s cylinder was lighter than the ball whereas in the second sequence the cylinder was heavier than the ball. Beyond this we can deduce that the cylinder in the first sequence was much lighter than the one in the second.

Humans demonstrate the profound ability to understand the underlying physics of the world [9, 10] and use it to predict the future. We use this physical commonsense for not only rich understanding but also for physical interactions. The question arises as to whether this physical commonsense is just an end-to-end model with intermediate representations being a black-box, or explicit and meaningful intermediate representations? For humans, the answer appears to be the latter. We can predict the future if some underlying conditions are changed. For example, we can predict that if we throw the ball in the second sequence with 10x initial speed then the cylinder might rotate.

In this paper, we focus on learning an intuitive model of physics [2, 13, 17]. Unlike some recent efforts, where the goal is to learn physics in an end-to-end manner with little-to-no constraints on intermediary layers, we focus on learning an interpretable model. More specifically, the bottleneck layers in our network model physical properties such as mass, friction, etc.

Learning an interpretable intuitive physics model is, however, quite a challenging task. For example, Wu et al. [25] attempts to build a model but the inverse graphics engine infers physical properties such as mass and friction. These properties are then used with neural physics engine or simulators for prediction. But can we really infer physical properties from the few frames of such collisions? Can we separate friction from mass, restitution by observing the frames? The fact is most of these physical factors are so dependent that it is infeasible to infer the exact values of physical properties. For example we can determine ratios between properties but not the precise values of both (e.g., we can determine the relative mass between two objects but not the exact values for both). This is precisely why in [25] only one factor is inferred from motion and the other factor is directly correlated to the appearance. Furthermore, the learned physics model is domain-specific and will not generalize–even across different shapes.

To tackle these challenges, we propose an interpretable intuitive physics model, where specific dimensions in the bottleneck layers correspond to different physical properties. The bottleneck layer models the distribution rather than infer precise values of mass, speed and friction. In order to demonstrate that our system models these underlying physical properties, we train our model on collision of different shapes (cube, cone, cylinder, spheres etc.) and test on collisions of unseen combinations of shapes altogether. We also demonstrate the richness of our model by predicting the future states under different physical conditions (e.g., how the future frames will look if the friction is doubled).

Our contributions include: (a) an intuitive physics model that disentangles different physical properties in an interpretable way; (b) a staggered training algorithm designed to distinguish the subtleties between different physical quantities; (c) generalization to different shapes and physical quantity combinations; most importantly, (d) the ability to adapt future predictions when physical environments change. Note (d) is different from generalization: the hallucination/prediction is done for a physical scene completely different from the observed first four frames.

2 Related Work

Physical reasoning and learning physical commonsense has raised a lot of interest in recent years [1, 5, 16,17,18, 28, 29, 31]. There has been multiple efforts to learn implicit and explicit models of physics commonsense. The underlying goal of most of these systems is to use physics to predict what is going to happen next [6,7,8, 13, 14, 24, 26]. The hope is that if the model can predict what is going to happen next after interacting with objects, it will be forced to understand the physical properties of the objects. For example, [13] tries to learn the physical properties by predicting whether a tower of blocks will fall. [7] proposed to learn a visual predictive model for playing billiards.

However, the first issue is what is the right data to learn this physics model. Researchers have tried a wide spectrum of approaches. For example, many researchers have focused on the task of visual prediction using real-world videos, based on the hypothesis that the predictive model will contain some underlying physical properties [15, 21, 22]. While videos provide realistic data, there is little to no control on how the data is collected and therefore the implicit models end up learning dynamic models of texture. In order to force physical commonsense learning, people have even tried using videos of physical interactions. For example, Physics101 dataset [24] collects sequences of collisions for this task. But most of the learning still happens passively (random batches). In order to overcome that, recent approaches have tried to learn physics by active interaction using robots [1, 6, 18]. While there is more control in the process of data collection, there are still issues with lack of diverse data due to most experiments being performed in lab setting with few objects. Finally, one can collect data with full control over several physical parameters using simulation. There has been lot of recent efforts in using simulation to learn physical models [7, 13, 16, 17]. One limitation of these approaches, in terms of data, is the lack of diversity during training, which forces them to learn physics models specific to particular shapes such as blocks, spheres etc. Furthermore, none of these approaches use the full power of simulation to generate a dense set of videos with multiple conditions. Most importantly, none of these approaches learn an interpretable model.

Apart from the question of data, another core issue is how explicit is the representation of physics in these models. To truly understand the object physical properties, it requires our model to be interpretable [3, 4, 12, 23, 25]. That is, the model should not only be able to predict the futures, but the latent representations should also indicate the physical properties (e.g., mass, friction and speed) implicitly or explicitly. For example, [3] proposed an Interaction Network which learns to predict the rigid body dynamics of gravitational systems. [25] proposed to explicitly estimate the physical object states and forward this state information to a physics engine for prediction. However, we argue exact values of these physical properties might not be possible due to entanglement of various factors. Instead of estimating the physics states explicitly, our work focuses on separating the dimensions in the bottleneck layer.

Our work is mostly related to the Inverse Graphics Network [12]. It learns a disentangled representation in the graphics code layer where different neurons are encouraged to represent different transformations including pose and light. The system can be trained in an end-to-end manner without providing an explicit state value as supervisions for the graphics code layer. However, unlike the Inverse Graphics Network, where pose and light can be separately inferred from the input images, the dynamics are dependent on the joint set of physical properties in our model (mass, friction and speed), which confound future predictions.

Our model is also related to the visual prediction models [11, 15, 19, 20, 22, 27, 30] in computer vision. For example, [20] proposed to directly predict a sequence of video frames in raw pixels given a sequence of former frames as inputs. Instead of directly predicting the pixels, [22] proposed to predict the optical flows given an input image and then warp the flows on the input images to generate future frames. However, the optical flow estimation is not always correct, introducing errors in the supervisions for training. To tackle this, [30] proposed a bilinear sampling layer which makes the warping process differentiable. This enables them to train their prediction model from pixels to pixels in an end-to-end manner.

3 Dataset

We create a new dataset for our experiments in this paper. The advantage of our proposed dataset is that we have rich combinations of different physical properties as well as different object appearances for different types of collisions (falling over, twisting, bouncing, etc.). Unlike previous datasets, the physical properties in our dataset are independent from the object shapes and appearance. In this way, we can train models which force estimation of physical properties by observing the collisions. More importantly, our testing sets contain novel combinations of object shapes or physical properties that are unseen in the training set. The details of dataset generation is illustrated as following.

We generate our data using the Unreal Engine 4 (UE4) game engine. We use 11 different object combinations with 5 unique basic objects: sphere, cube, cylinder, cone, and wedge. We select 3 different physical properties including mass of static object, initial speed of colliding object and friction of floor. For each property, we choose 5 different scales of values as shown in Table 1. For simplicity, we specify a certain scale of parameter by the format {parameter name}$_{\{scale\}}$ (e.g., mass$_1$, friction$_4$, speed$_2$). We simulate all the $5 \times 5 \times 5 = 125$ sets of physical combinations. For each set of physical property combination, there are 11 different object combinations and 15 different initial rotation and restitution. Thus in total there are $125\,\times \,15\,\times \,11=20625$ collisions. Each collision is represented by 5 sample frames with 0.5 s time intervals between them.

The diversity in our dataset is highlighted in Fig. 2. For example, our dataset has cones toppling over; cylinders falling down when hit by a ball and rolling cylinders. We believe this large diversity makes it one of the most challenging datasets to learn and disentangle physical properties.

For training, we use 124 sets of physics combination with 9 different object combinations (16740 collisions). The remaining data are used for two types of testing: (i) parameter testing and (ii) shape testing. The parameter testing set contains 135 collisions with unseen physical parameter combinations (mass$_3$, speed$_3$, friction$_3$) but seen object shape combinations. The shape testing set on the other hand, contains 3750 collisions with 2 unseen shape combinations yet seen physical parameter combinations. We show the generalization ability of our physics model on both testing conditions.

Table 1. Dataset settings

Full size table

4 Interpretable Physics Model

Our goal is to develop a physics-based reasoning network to solve prediction tasks, e.g., physical collisions, while having interepretable intermediate representations.

4.1 Visual Prediction Model

As illustrated in Fig. 3, our model takes in 4 RGB video frames as input and learns to predict the future 5th RGB frame after the collisions. The model is composed with two parts: an encoder for extracting abstract physical representations and a decoder for future frame prediction.

Encoder for Physics Representations. The encoder is designed to capture the motion of two colliding objects, from which the physical properties can be inferred. Given 4 RGB frames as inputs, they are first forwarded to a ConvNet with AlexNet architecture and ImageNet pre-training. We extract the pool5 feature for each video frame and concatenate the features together as a representation for the input sequence. This feature is then forwarded to two convolutional layers and four fully connected layers to obtain the physics representation.

The physics representation is a 306 dimensional vector, which contains disentangled neurons of mass (dimensions 1 to 25), speed (dimensions 26 to 50), friction (dimensions 51 to 75), and other intrinsic information (dimensions 76 to 306), as shown in Fig. 3. Note that although the vector is disentangled, there is no explicit meanings for each neuron value.

Decoder for Future Prediction. The physics representation is forwarded to a decoder for future frame prediction. Our decoder contains one fully-connected layer followed by six decovolutional layers. Inspired by [22, 30], our decoder uses optical flow fields as the output representation instead of directly outputing the RGB raw pixel values. The optical flow is then used to perform warping on the last input frame by a bilinear sampling layer [30] to generate the future frame. Since the bilinear sampling layer is differentiable, the network can be trained in an end-to-end manner with the 5th frame for direct supervision.

There are two major advantages of using optical flow as outputs: (i) it can force the model to learn the factors that cause the changes between two frames; (ii) it allows the model to focus on the changes of the foreground objects.

4.2 Learning Objective

Formally, we define the encoder as a function f and the decoder as a function g. Given an image sequence x as inputs (4 frames), our encoder transforms the images into a physically meaningful and disentangled representation $z = f(x)$ and then the decoder transforms this representation into a future frame $y = g(z)$.

The disentangled representation z can be formulated as $z = (\phi ^{m}, \phi ^{s}, \phi ^{f}, \phi ^{i})$ where $(\cdot , \cdot )$ denotes concatenation. The first part $(\phi ^{m}, \phi ^{s}, \phi ^{f})$ denotes the combination physics variable, which encodes the physical quantities (m, s, f stands for mass, speed, and friction respectively). The second part $\phi ^{i}$ is the intrinsic variable, representing all the other intrinsic properties in the scene (e.g., colors, shapes and initial rotation).

In this paper, we study the effect of varying the values of physical quantities in a two-object collision scenario. Following the strategy in [12], we group our training sequence samples into mini-batches. Inside one mini-batch, only one physical property changes across all the samples and other physical properties remain fixed. We denote $B^{p}=\{(x_k, y_k)\}_{k=1}^{5}$ as one mini-batch with 5 sequences, where the only changing property is p (i.e., we use p as a variable to represent either mass, speed or friction).

For each mini-batch $B^p$ during training, we encourage only the dimensions corresponding to the property p to change in z. For example, when training with a mini-batch where only mass is changing, we force the network to have different values in the dimensions for $\phi ^{m}$ and same values for the rest of the dimensions in z. For simplicity, we further denote the dimensions which relevant to p in z as $\phi ^{p}_k$ and the rest of the dimensions as $\bar{\phi ^{p}_k}$ for example k.

We train our prediction model with this constraint. Assuming we are training with one batch $B^{p}=\{(x_k, y_k)\}_{k=1}^{5}$. In a maximum likelihood estimation (MLE) framework, this can be formulated as maximizing the log-probabilities under the desired constraints:

(1)

where $\bar{\phi ^{p}_k}$ contains both the intrinsic variable inferred from image sequence $x_k$ and inferred physics variables, except for the changing parameter.

In our auto-encoder architecture, the objective function is equivalent to minimizing the l1 distance between the predicted images $\hat{y}_{k}$ and the ground truth future images $y_{k}$:

$$\begin{aligned} \begin{aligned} \mathcal {L}_{mle} = \sum _{k} ||\hat{y}_{k} - y_{k}||_1. \end{aligned} \end{aligned}$$

(2)

The constraints in Eq. 1 can be satisfied via minimizing the loss between $\bar{\phi ^{p}_k}$ and the mean of them within the mini-batch $\bar{\phi ^{p}} = \frac{1}{5} \sum _k \bar{\phi ^{p}_k}$ as,

$$\begin{aligned} \begin{aligned} \mathcal {L}_{ave} =\sum _{k} ||\bar{\phi ^{p}_k} - \bar{\phi ^{p}}||_2^2. \end{aligned} \end{aligned}$$

(3)

We apply both losses jointly during training our model with a constant $\lambda $ balancing between them as,

$$\begin{aligned} \begin{aligned} \mathcal {L} = {L}_{mle} + \lambda {L}_{ave}. \end{aligned} \end{aligned}$$

(4)

In practice, we set the $\lambda $ dynamically so that both gradients are maintained in the same magnitude. The value of $\lambda $ is around $1e-6$.

4.3 Staggered Training

Although we follow the training objective proposed in [12], it is actually non-trivial to directly optimize with this objective. There is a fundamental difference between our problem and the settings in [12]: the physical dynamics are dependent across the set of properties, which confounds training. The same sequence of inputs and output ground-truth might infer different combinations of the physical properties. For example, both large friction and slow speed can lead to small movements of the second object after collision. Thus modifications on training method is required to handle this multi-modality issue.

We propose a staggered training algorithm to alleviate this problem. We first divide the entire training set D into 3 different sets $\{D^{p}\}$, where p indicates one of the physics properties (mass, speed or friction). Each $D^p$ contains different mini-batches of $B^p$, inside which the only changing property is indicated by p.

The idea is: instead of training with all the physics properties at the same time in the beginning, we perform curriculum learning. We first train the network with one subset $D^p$ and then progressively add more subsets with different properties into training. In this way, our training set becomes larger and larger through time. By learning the physics properties in this sequential manner, we force the network to recognize new physical properties one by one while keeping the learned properties. In practice, we observe that in the first training session, the network behaves normally. For the following training sessions, the loss will increase in the beginning, and will decrease to roughly the same level as the previous session.

5 Experiments

We now demonstrate the effectiveness and generalization of our model. We will perform two sets of experiments with respect to two different testing sets in our dataset. One tests on unseen physical property combinations but seen shape combinations, and the other tests on unseen shape combinations with seen physical properties. Before going into further analysis, we will first describe the implementation details of our model and the baseline method.

Implementation Details. In total, we trained for 319 epochs. We used ADAM for optimization, with initial learning rate $10^{-6}$. During training, each mini-batch mentioned above has 5 sequences. During the training for the first physical quantity, each batch contains 3 mini-batches, which means 15 data in total. For the second round of staggered training, each batch contains 2 mini-batches, one for each physical quantity; similarly, in the third round of training, each batch contains 3 mini-batches, one for each physical quantity.

Baseline Model. Our baseline model learns intuitive physics in an end-to-end manner and post-hoc obtains the dimensions that correspond to different physical properties. We need the disentagled representation because we want to test the generalization when the physical properties are different from input video: e.g., what happens if friction is doubled? What happens if the speed is 1/10th?

For the baseline, we use the same network architecture. Different from our approach, we do not add any constraints on the bottleneck representation layer as in Eq. 1 in the baseline model. However, we still want to obtain the disentangled representation from this baseline for comparison. Recall that we have a subset $D^p$ for each property p (mass, friction or speed). The examples in each mini-batch inside $D^p$ specify the change of property p. We compute the variances for each neuron in the bottleneck representation for each $D^p$, and select 25 dimensions with top variances as the vector indicating property p.

5.1 Visual Prediction

Unseen Parameters. First we evaluate if we can predict future pixels when we see a novel combination of physical parameters. Specifically, our model has never seen in training a combination of mass = 3, friction = 3 and speed = 3. Figure 4 shows our interpretable model generalizes well and produces high quality predictions.

Unseen Shape Combinations. Next, we want to explore if our visual prediction model generalizes to different shape combinations using two unseen sets: (a) cone and cuboid; (b) cuboid and sphere. To demonstrate that our model understands each of these physical properties, we show contrasted prediction results for two different values. For example, we will use different friction values (1, 5) but same mass and speed. Comparing these two outputs should highlight how our approach understands the underlying friction values.

As shown in Fig. 5, our predicted future frame has high quality compared to the ground-truth. We show that our model can generalize the physics reasoning to unseen objects and learn to output different collisions results given different physical environments. For example in the second condition, when the mass of sphere is high (5), our approach can predict it will not move and instead the cube will bounce back. We also compare our approach to baseline quantitatively: our approach has pixel error of 87.3, while baseline has pixel error of 95.6. The results clearly indicate our interpretable model tends to generalize better than an end-to-end model when test conditions are very different.

In addition to the baseline, we also compare our model with two other methods based on optical flow. First, we trained another prediction network using the optical flow computed between the 4th and the 5th frame as direct supervisions, instead of using the pixels of the 5th frame. For testing, we apply the predicted optical flows on the 4th frame to generate the future frame. The loss between the future frame and the ground-truth 5th frame is 118.8. Second, we computed 3 optical flows of first 4 frames, using which to find a linear model to generate the future optical flow. We apply this optical flow on the 4th frame and compare the result to the ground-truth 5th frame. The error reaches to 292.5. The result shows that our method achieves high precision than using optical flow directly.

5.2 Physical Interpolation

To show our model has actually learnt physics properties, we perform a series of interpolations on the bottleneck representation.

Interpolating Physics Representation Within a Mini-Batch. We first show that the learned bottleneck layer is meaningful and smooth. To demonstrate this, we interpolate between different physical properties and compare our result with the ground-truth. The experiment is conducted in the following way. Let’s take mass as an example: given a mini-batch where only mass changes, we use the encoder to get the physics vector $z_1 = (\phi ^m_1, \phi ^s_1, \phi ^f_1, \phi ^i_1)$ from mass$_1$ data and $z_5 = (\phi ^m_5, \phi ^s_5, \phi ^f_5, \phi ^i_5)$ from mass$_5$ data. To estimate the physics vector for mass$_i$, we interpolate a new mass variable $\hat{\phi }^m_i = (1-0.25i) \cdot \phi ^m_1 + 0.25i \cdot \phi ^m_5$ and use this to create a new physics vector $\hat{z}_i = (\hat{\phi }^m_i, \phi ^s_1, \phi ^f_1, \phi ^i_1)$. We pass the new vector to the decoder to predict the optical flows, which are warped to the 4th image in sequence i via the bilinear sampling layer, and generate the future frame.

We perform the same set of experiments for the baseline model. Quantitatively, we evaluate the prediction using the sum of mean square error for each pixel, as shown in Table 2, which shows that our method is significantly better than the baseline. We also visualized the results in Fig. 6. Interestingly, our interpolation results are also very close to the ground-truth. On the other hand, baseline models failed easily when there is a dramatic change during interpolations.

We also trained another model which takes physics parameters and the optical flows of first 4-frame as inputs, and predicts the future frame. This model performs much worse than our model in the interpolation test as shown in Fig. 6. We believe a ground-truth physics parameter based approach focuses on classification instead of learning an intuitive physics model. In interpolation experiments, the model cannot separate physics information from the optical flow features.

From these comparison, we can see that only by learning interpretable representations, we can generate reasonable prediction results after interpolations.

Table 2. Interpolation result. The numbers are pixel prediction errors

Full size table

Changing Physical Properties. In this experiment, we show that physics variables learned by our model are interpretable by finding a mapping between different scale of the same physical property. Specifically, we want to see: can we predict the future if the mass is doubled while all other physics conditions remain the same? For each physical quantity p, we train two networks $F^p_2$ and $F^p_3$ which learns to double or triple the scale of a physical property. For example, we can project the physics representation of mass$_1$ to mass$_3$ by using the network $F^p_3$. The network architecture for both $F^p_2$ and $F^p_3$ is a simple 2-layer fully connected network with 256 hidden neurons per layer. These two networks can be trained using the physical representations inferred by our encoder with the training data.

In testing time, we apply the similar interpolation as the last experiment. The only difference is that instead of using an interpolation between two relevant representations, we use the fully connected network to generate the new representations. We again evaluate the quantitative results by computing the mean square error over the pixels. As shown in Table 3, we have a larger performance gain in this setting compared to the baseline. Figure 7 shows the prediction results of our model when the physics property is enlarged from scale 1 to 2 and 3, which are all very close to the ground-truth. This is another evidence showing our physics representation is interpretable and generalizes significantly better.

Table 3. Ratio result. Comparing visual prediction when underlying physical parameters are changed by a factor

Full size table

Switching Between the Object Shapes. In experiments above, we interpolate the physics representation and apply them to the same object shape combinations. In this experiment, for a physical property p, we replace the corresponding variable $\phi ^p$ of one collision with the variable from another collision with different objects but the same p value. We visualize the results in Fig. 8, where the first line shows the predictions when we replace current $\phi ^p$ with one from another shape combination. The results are almost same as the original prediction and the ground-truth, which means that the physical variable of same value can be transferred among different shape combinations. It also shows that the dimensions of physics and other dimensions are independent and can be appended easily.

6 Conclusions

We demonstrated an interpretable intuitive physics model that generalizes across scenes with different underlying properties and object shapes. Most importantly, our model is able to predict the future when physical environment changes. To achieve this we proposed a model where specific dimensions in the bottleneck layers correspond to different physical properties. However, often physical properties are dependent and intertangled, so we introduced a training curriculum and generalized loss function that was shown to outperform the baseline approaches.

References

Agrawal, P., Nair, A., Abbeel, P., Malik, J., Levine, S.: Learning to poke by poking: experiential learning of intuitive physics. In: Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Battaglia, P., Pascanu, R., Lai, M., Rezende, D.J., et al.: Interaction networks for learning about objects, relations and physics. In: Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Battaglia, P., Pascanu, R., Lai, M., Rezende, D.J., et al.: Interaction networks for learning about objects, relations and physics. In: Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Chang, M.B., Ullman, T., Torralba, A., Tenenbaum, J.B.: A compositional object-based approach to learning physical dynamics. In: International Conference on Learning Representations (ICLR) (2017)
Google Scholar
Edmonds, M., et al.: Feeling the force: integrating force and pose for fluent discovery through imitation learning to open medicine bottles. In: Intelligent Robots and Systems (IROS) (2017)
Google Scholar
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Fragkiadaki, K., Agrawal, P., Levine, S., Malik, J.: Learning visual predictive models of physics for playing billiards. In: International Conference on Learning Representations (ICLR) (2016)
Google Scholar
Grzeszczuk, R., Terzopoulos, D., Hinton, G.: Neuroanimator: fast neural network emulation and control of physics-based models. In: Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, pp. 9–20. ACM (1998)
Google Scholar
Hamrick, J., Battaglia, P., Tenenbaum, J.B.: Internal physics models guide probabilistic judgments about object dynamics. In: Proceedings of the 33rd Annual Conference of the Cognitive Science Society (2011)
Google Scholar
Hamrick, J.B., Battaglia, P.W., Griffiths, T.L., Tenenbaum, J.B.: Inferring mass in complex scenes by mental simulation. Cognition (2016)
Google Scholar
Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_15
Chapter Google Scholar
Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.: Deep convolutional inverse graphics network. In: Neural Information Processing Systems (NIPS) (2015)
Google Scholar
Lerer, A., Gross, S., Fergus, R.: Learning physical intuition of block towers by example. In: International Conference on Machine Learning (ICML) (2016)
Google Scholar
Li, W., Azimi, S., Leonardis, A., Fritz, M.: To fall or not to fall: a visual approach to physical stability prediction. arXiv:1604.00066 (2016)
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: International Conference on Learning Representations (ICLR) (2016)
Google Scholar
Mottaghi, R., Bagherinezhad, H., Rastegari, M., Farhadi, A.: Newtonian scene understanding: unfolding the dynamics of objects in static images. In: Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Mottaghi, R., Rastegari, M., Gupta, A., Farhadi, A.: “What happens if...” learning to predict the effect of forces in images. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 269–285. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_17
Chapter Google Scholar
Pinto, L., Gandhi, D., Han, Y., Park, Y.-L., Gupta, A.: The curious robot: learning visual representations via physical interactions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 3–18. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_1
Chapter Google Scholar
Qi, S., Jia, B., Zhu, S.C.: Generalized earley parser: bridging symbolic grammars and sequence data for future prediction. In: International Conference on Machine Learning (ICML) (2018)
Google Scholar
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning (ICML) (2015)
Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_51
Chapter Google Scholar
Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P., Zoran, D.: Visual interaction networks. In: Neural Information Processing Systems (NIPS) (2017)
Google Scholar
Wu, J., Lim, J.J., Zhang, H., Tenenbaum, J.B., Freeman, W.T.: Physics 101: learning physical object properties from unlabeled videos. In: BMVC (2016)
Google Scholar
Wu, J., Lu, E., Kohli, P., Freeman, W.T., Tenenbaum, J.B.: Learning to see physics via visual de-animation. In: Neural Information Processing Systems (NIPS) (2017)
Google Scholar
Wu, J., Yildirim, I., Lim, J.J., Freeman, W.T., Tenenbaum, J.B.: Galileo: perceiving physical object properties by integrating a physics engine with deep learning. In: Neural Information Processing Systems (NIPS) (2015)
Google Scholar
Xue, T., Wu, J., Bouman, K.L., Freeman, W.T.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In: Neural Information Processing Systems (NIPS) (2016)
Google Scholar
Zhang, R., Wu, J., Zhang, C., Freeman, W.T., Tenenbaum, J.B.: A comparative evaluation of approximate probabilistic simulation and deep neural networks as accounts of human physical scene understanding. In: Proceedings of the 38th Annual Conference of the Cognitive Science Society (2016)
Google Scholar
Zheng, B., Zhao, Y., Yu, J., Ikeuchi, K., Zhu, S.C.: Scene understanding by reasoning stability and safety. Int. J. Comput. Vis. (IJCV) 112, 221–238 (2015)
Article MathSciNet Google Scholar
Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 286–301. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_18
Chapter Google Scholar
Zhu, Y., Jiang, C., Zhao, Y., Terzopoulos, D., Zhu, S.C.: Inferring forces and learning human utilities from videos. In: Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar

Download references

Acknowledgement

Research was sponsored by the Army Research Office and was accomplished under Grant Number W911NF-18-1-0019. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Office or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. We would like to thank Yin Li and Siyuan Qi for helpful discussions.

Author information

Authors and Affiliations

Robotics Institute, Carnegie Mellon University, Pittsburgh, USA
Tian Ye, Xiaolong Wang & Abhinav Gupta
Third Wave Automation, Pittsburgh, USA
James Davidson

Authors

Tian Ye
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolong Wang
View author publications
You can also search for this author in PubMed Google Scholar
James Davidson
View author publications
You can also search for this author in PubMed Google Scholar
Abhinav Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tian Ye .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ye, T., Wang, X., Davidson, J., Gupta, A. (2018). Interpretable Intuitive Physics Model. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11216. Springer, Cham. https://doi.org/10.1007/978-3-030-01258-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-01258-8_6
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01257-1
Online ISBN: 978-3-030-01258-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Interpretable Intuitive Physics Model

Abstract

Similar content being viewed by others

Intuitive understanding of the relationship between the elasticity of objects and kinematic patterns of collisions

Different Physical Intuitions Exist Between Tasks, Not Domains

PIP: Physical Interaction Prediction via Mental Simulation with Span Selection

Keywords

1 Introduction

2 Related Work

3 Dataset

4 Interpretable Physics Model

4.1 Visual Prediction Model

4.2 Learning Objective

4.3 Staggered Training

5 Experiments

5.1 Visual Prediction

5.2 Physical Interpolation

6 Conclusions

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us