Abstract
Temporal prediction is an important function in autonomous driving (AD) systems as it forecasts how the environment will change and transform in the next few seconds. Humans have an inherited prediction capability that extrapolates a present scenario to the future. In this paper, we present a novel approach to look further into the future using a standard semantic segmentation representation and time series networks of varying architectures. An important property of our approach is its flexibility to predict an arbitrary time horizon into the future. We perform prediction in the semantic segmentation domain where inputs are semantic segmentation masks. We present extensive results and discussion on different data dimensionalities that can prove beneficial for prediction on longer time horizons (up to \(2\,\textrm{s}\)). We also show results of our approach on two widely employed datasets in AD research, i.e., Cityscapes and BDD100K. We report two types of mIoUs as we have investigated with self generated ground truth labels (mIoU\(^{seg}\)) for both of our dataset and actual ground truth labels (mIoU\(^\textrm{gt}\)) for a specific split of the Cityscapes dataset. Our method achieves \(57.12\%\) and \(83.95\%\) mIoU\(^{seg}\), respectively, on the validation split of BDD100K and Cityscapes, for short-term time horizon predictions (up to \(0.2\,\textrm{s}\) and \(0.06\,\textrm{s}\)), outperforming the current state of the art on Cityscapes by \(13.71\%\) absolute. For long-term predictions (up to \(2\,\textrm{s}\) and \(0.6\,\textrm{s}\)), we achieve \(37.96\%\) and \(63.65\%\) mIoU\(^{seg}\), respectively, for BDD100K and Cityscapes. Specifically on the validation split of Cityscapes with perfect ground truth annotations, we achieve \(67.55\%\) and \(63.60\%\) mIoU\(^\textrm{gt}\), outperforming current state of the art by \(1.45\%\) absolute and \(4.2\%\) absolute with time horizon predictions up to \(0.06\,\textrm{s}\) and \(0.18\,\textrm{s}\), respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: Proceedings of ICIP, Melbourne, VIC, Australia, pp. 3464–3468, September 2016
Breitenstein, J., Termöhlen, J.A., Lipinski, D., Fingscheidt, T.: Systematization of corner cases for visual perception in automated driving. In: Proceedings of IV, Las Vegas, NV, USA, pp. 986–993, October 2020
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of CVPR, Las Vegas, NV, USA, pp. 3213–3223, June 2016
Duwek, H.C., Shalumov, A., Tsur, E.E.: Image reconstruction from neuromorphic event cameras using Laplacian-prediction and poisson integration with spiking and artificial neural networks. In: Proceedings of CVPR - Workshops, pp. 1333–1341. Virtual, June 2021
Fingscheidt, T., Gottschalk, H., Houben, S. (eds.): Deep Neural Networks and Data for Automated Driving: Robustness, Uncertainty Quantification, and Insights Towards Safety. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-01233-4
Guen, V.L., Thome, N.: Disentangling physical dynamics from unknown factors for unsupervised video prediction. In: Proceedings of ICCV, Los Alamitos, CA, USA, pp. 11471–11481, June 2020
Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Comput. 9(8), 1735–1780 (1997)
Jin, X., et al.: Predicting scene parsing and motion dynamics in the future. In: Proceedings of the NeurIPS, Long Beach, CA, USA, December 2017
Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans. ASME-J. Basic Eng. 82(Series D), 35–45 (1960)
Kwon, Y.H., Park, M.G.: Predicting future frames using retrospective cycle GAN. In: Proceedings of CVPR, Long Beach, CA, USA, pp. 1811–1820, June 2019
Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: Proceedings of the ICCV, Venice, Italy, pp. 4463–4471, October 2017
Lotter, W., Kreiman, G., Cox, D.D.: Deep predictive coding networks for video prediction and unsupervised learning. arXiv, August 2016
Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: Proceedings of ICCV, Venice, Italy, pp. 648–657, October 2017
Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proceedings of IJCAI, Vancouver, BC, Canada, pp. 674–679, August 1981
Maas, A., Hannun, A., Ng, A.: Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of ICML, Atlanta, Georgia (2013)
Mahjourian, R., Wicke, M., Angelova, A.: Geometry-based next frame prediction from monocular video. In: Proceedings of IV, pp. 1700–1707, June 2017
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi scale video prediction beyond mean square error. In: Proceedings of ICLR, San Juan, Puerto Rico, pp. 1–14, May 2016
Nabavi, S.S., Rochan, M., Wang, Y.: Future semantic segmentation with convolutional LSTM. In: Proceedings of BMVC, Newcastle, UK, pp. 1–12, September 2018
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of CVPR, Las Vegas, NV, USA, pp. 779–788, June 2016
Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: Proceedings of CVPR, Honolulu, HI, USA, July 2017
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of NIPS, Montreal, QC, Canada, pp. 802–810, December 2015
Walker, J., Razavi, A., van den Oord, A.: Predicting video with VQVAE. CoRR abs/2103.01950 (2021). https://arxiv.org/abs/2103.01950
Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multi-object tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 107–122. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_7
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Proceedings of NeurIPS, pp. 12077–12090. Virtual Conference, December 2021
Yu, F., et al.: BDD100K: a diverse driving dataset for heterogeneous multitask learning. In: Proceedings of CVPR, Seattle, WA, USA, pp. 1–14, June 2020
Zhao, H., Zhang, S., Wu, G., Moura, J.M.F., Costeira, J.P., Gordon, G.J.: Adversarial multiple source domain adaptation. In: Proceedings of NeurIPS, Montréal, QC, Canada, pp. 8568–8579, December 2018
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: Proceedings of CVPR, Honulu, HI, USA, pp. 2881–2890, July 2017
Zhao, H., et al.: PSANet: point-wise spatial attention network for scene parsing. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11213, pp. 270–286. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_17
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclaimer
The results, opinions and conclusions expressed in this publication are not necessarily those of Volkswagen Aktiengesellschaft.
A Supplementary Material
A Supplementary Material
1.1 A.1 Qualitative Results
In this section, we show the qualitative results of our method on sequences of BDD100K, \(\mathcal {D}^\mathrm {BDD-MOTS}_\textrm{val}\) and Cityscapes, \(\mathcal {D}^\mathrm {CS-vid}_\textrm{val}\). In Fig. 5, we show the qualitative results of prediction method on a sequence of \(\mathcal {D}^\mathrm {BDD-MOTS}_\textrm{val}\). We can observe that, for increasing time steps, i.e., \(\varDelta {t}=\{1, 2, 5, 10\}\), the prediction worsens for dynamic objects. This can be inferred from the increase in white regions in the absolute difference estimation visualizations (bottom row) defined as \(\hat{\textbf{d}}_{t}=|\hat{\textbf{m}}_{t}-\overline{\textbf{m}}_{t}|\). Also, majority of the predictions are incorrect in the boundary segregation of different classes, i.e., where car pixel occupancy meets road occupancy, where sidewalk occupancy meets road occupancy.
Specifically for the Cityscapes dataset, in Fig. 6, the predictions (middle row) are at par with their corresponding 20\(^\textrm{th}\) frame ground truth annotations (top row) for \(\varDelta {t}=\{1, 2, 5\}\), obtained from the dataset. The semantic class boundaries are so well captured with noise supression in predictions (middle row). However, for \(\varDelta {t}=10\), we can see that the prediction focuses more on predicting the static classes than the finer dynamic classes boundary details, i.e., visible from the missing sidewalk in the left region of \(\hat{\textbf{m}}_{t+10}\) that is visible in the ground truth annotation \(\overline{\textbf{m}}_{t+10}\) (in pink) of Fig. 6.
Output predictions for a sequence of the BDD100K validation split, \(\mathcal {D}^\mathrm {BDD-MOTS}_\textrm{val}\). The top row depicts the pseudo ground truth \(\overline{\textbf{m}}_{t}\), \(\overline{\textbf{m}}_{t+1}\), \(\overline{\textbf{m}}_{t+2}\), \(\overline{\textbf{m}}_{t+5}\), \(\overline{\textbf{m}}_{t+10}\) generated by PSANet. In the middle row, we show the input semantic segmentation \(\overline{\textbf{m}}_{t}\) along with the predictions \(\hat{\textbf{m}}_{t+1}\), \(\hat{\textbf{m}}_{t+2}\), \(\hat{\textbf{m}}_{t+5}\), \(\hat{\textbf{m}}_{t+10}\) from the prediction network. The bottom row portrays the absolute difference \(\hat{\textbf{d}}_{t+1}\), \(\hat{\textbf{d}}_{t+2}\), \(\hat{\textbf{d}}_{t+5}\), \(\hat{\textbf{d}}_{t+10}\), between the ground truth and prediction frames.
Output predictions for a sequence of the Cityscapes validation split, \(\mathcal {D}^\mathrm {CS-vid}_\textrm{val}\). The top row depicts actual 20\(^\textrm{th}\) frame ground truth annotations available in the dataset for \(\overline{\textbf{m}}_{t}\), \(\overline{\textbf{m}}_{t+1}\), \(\overline{\textbf{m}}_{t+2}\), \(\overline{\textbf{m}}_{t+5}\), \(\overline{\textbf{m}}_{t+10}\). In the middle row, we show the input semantic segmentation \(\overline{\textbf{m}}_{t}\) along with the predictions \(\hat{\textbf{m}}_{t+1}\), \(\hat{\textbf{m}}_{t+2}\), \(\hat{\textbf{m}}_{t+5}\), \(\hat{\textbf{m}}_{t+10}\) from the prediction network. The bottom row portrays the absolute difference \(\hat{\textbf{d}}_{t+1}\), \(\hat{\textbf{d}}_{t+2}\), \(\hat{\textbf{d}}_{t+5}\), \(\hat{\textbf{d}}_{t+10}\), between the ground truth and prediction frames.
A semantic segmentation input mask \(\overline{\textbf{m}}_{t}\) from \(\mathcal {D}^\mathrm {BDD-MOTS}_\textrm{val}\) showing different semantic classes. The left white encircled region portrays the proximal occurence of class sidewalk (\(s=2\)) near to class road (\(s=1\)). Similarly, the right white encircled region portrays the proximal occurrence of class road (\(s=1\)) near to class car (\(s=14\)).
1.2 A.2 Prediction Invariance on the Ordering of Semantic Classes
We investigate the behavior of our model when 1-channel inputs are fed to our predictor network, i.e., the generated pseudo ground truth \(\overline{\textbf{m}}_{t} \in \mathcal {S}^{H \times W \times 1}\) for the BDD100K dataset \(\mathcal {D}^\mathrm {BDD-MOTS}\). The semantic segmentation mask \(\overline{\textbf{m}}_{t}\) contains class indices \(s \in \mathcal {S}=\{1, 2, ..., S\}\), where \(S=19\). Here, each semantic class corresponds to a specific class index s, e.g., \(s=0\) for class road, \(s=12\) for class person, etc according to the scene. For instance, if we consider a small region in a semantic segmentation mask, we usually find a certain semantic class pixels in close proximity with another semantic class pixels, i.e., class road pixels (\(s=1\)) almost always occurs adjacent to class car pixels (\(s=14\)) and class sidewalk pixels (\(s=2\)) almost are always adjacent to class road pixels (\(s=1\)) as can be seen in Fig. 7. To investigate the proposed method’s performance and robustness when the original class orientation is re-ordered, we conducted some experiments by shuffling the class indexes in the generated pseudo ground truth frames \(\overline{\textbf{m}}_{t}\). For instance, now the same scene would contain class road (\(s=5\)) adjacent to class car (\(s=16\)) and class sidewalk (\(s=9\)) adjacent to class road (\(s=5\)). Note that, the semantic classes remain the same, just the class indices are shuffled randomly. In Table 4, we can see the original class order along with the re-ordered class indices where semantic classes are marked by their actual defined colors in \(\mathcal {D}^{BDD-MOTS}\). can This is an important investigation to prove that our predictor model still learns the proximal relationship between the semantic classes instead of the numerical class indices occupancy, i.e., our model perfectly learns that class road pixels are most likely to occur near class car pixels and vice-versa, irrespective of their class indices value. Hence, we performed experiments by re-ordering the class indices of \(\mathcal {D}^\mathrm {BDD-MOTS}\) in such a way that the classes that occurred near to each other in terms of class index distance, e.g., road (\(s=1\)) and sidewalk (\(s=2\)) are now placed further apart, e.g., road (\(s=5\)) and sidewalk (\(s=9\)) as can be seen in Table 4.
Table 5 shows the confusion matrix for the original class order of \(\mathcal {D}^\mathrm {BDD-MOTS}\). The confusion matrix represents how each class in the prediction is confused and interpreted with respect to all the classes present in the ground truth and vice-versa. It can be observed that, every class is predicted well with the highest score for itself (see diagonal) except class rider (\(s=13\)) which is predicted as class person (\(s=12\)) with a score of 0.49 which is obvious as rider fits into the broader category of person after all. Similarly, class motorcycle (\(s=18\)) gets confused for class road (\(s=1\)) with a score of 0.33. This could be simply attributed the fact that the class road heavily overpowers the pixel distribution in all scenes whereas the class motorcycle has very minimal occupancy in most of the scenes. Now, in Table 6, we can see the confusion matrix for the first re-ordering of classes. In Table 6, it can be observed that, the predictor still confuses class rider (\(s=17\)) for class person(\(s=11\)) with a score of 0.34 and class motorcycle (\(s=19\)) for class road (\(s=5\)) with a score of 0.38. Similarly, in Table 7, we can see the confusion matrix for the second re-ordering of classes. We can see that the predictor once again interprets class rider (\(s=5\)) as class person (\(s=13\)) with a score of 0.37 and class motorcycle (\(s=11\)) as class road (\(s=8\)) with a score of 0.38. It can be inferred that the predictor regardless of class index ordering, behaves exactly the same for all the class predictions. Thus, the predictor can be safely labeled as invariant to the class ordering.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dash, B., Bilagi, S., Breitenstein, J., Schomerus, V., Bagdonat, T., Fingscheidt, T. (2023). Deep Long Term Prediction for Semantic Segmentation in Autonomous Driving. In: Ifrim, G., et al. Advanced Analytics and Learning on Temporal Data. AALTD 2023. Lecture Notes in Computer Science(), vol 14343. Springer, Cham. https://doi.org/10.1007/978-3-031-49896-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-49896-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-49895-4
Online ISBN: 978-3-031-49896-1
eBook Packages: Computer ScienceComputer Science (R0)