Skip to main content

STTR-3D: Stereo Transformer 3D Network for Video-Based Disparity Change Estimation

  • Conference paper
  • First Online:
Web and Big Data (APWeb-WAIM 2023)

Abstract

In the field of computer vision and stereo depth estimation, there has been little research in obtaining high-accuracy disparity change maps from two-dimensional images. This map offers information that fills the gap between optical flow and depth which is desirable for numerous academic research problems and industrial applications, such as navigation systems, driving assistance, and autonomous systems. We introduce STTR3D, a 3D extension of the STereo TRansformer (STTR) which leverages transformers and an attention mechanism to handle stereo depth estimation. We further make use of the Scene Flow FlyingThings3D dataset which openly includes data for disparity change and apply 1) refinements through the use of MLP over relative position encoding and 2) regression head with an entropy-regularized optimal transport to obtain a disparity change map. This model consistently demonstrates superior performance for depth estimation tasks as compared to the original model. Compared to the existing supervised learning methods for estimating stereo depth, our technique simultaneously handles disparity estimation and the disparity change problem with an end-to-end network, also establishing that the addition of our transformer yields improved performance that achieves high precision for both issues.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aich, S., Vianney, J.M.U., Islam, M.A., Liu, M.K.B.: Bidirectional attention network for monocular depth estimation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 11746–11752. IEEE (2021)

    Google Scholar 

  2. Badki, A., Troccoli, A., Kim, K., Kautz, J., Sen, P., Gallo, O.: Bi3D: stereo depth estimation via binary classifications. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1600–1608 (2020)

    Google Scholar 

  3. Behl, A., Hosseini Jafari, O., Karthik Mustikovela, S., Abu Alhaija, H., Rother, C., Geiger, A.: Bounding boxes, segmentations and object coordinates: How important is recognition for 3D scene flow estimation in autonomous driving scenarios? In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2574–2583 (2017)

    Google Scholar 

  4. Behl, A., Paschalidou, D., Donné, S., Geiger, A.: PointFlowNet: learning representations for rigid motion estimation from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7962–7971 (2019)

    Google Scholar 

  5. Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418 (2018)

    Google Scholar 

  6. Diamantas, S.C., Oikonomidis, A., Crowder, R.M.: Depth estimation for autonomous robot navigation: a comparative approach. In: 2010 IEEE International Conference on Imaging Systems and Techniques, pp. 426–430. IEEE (2010)

    Google Scholar 

  7. Dong, Q., Feng, J.: Outlier detection and disparity refinement in stereo matching. J. Vis. Commun. Image Represent. 60, 380–390 (2019)

    Article  Google Scholar 

  8. Fang, U., Li, J., Lu, X., Mian, A., Gu, Z.: Robust image clustering via context-aware contrastive graph learning. Pattern Recognit. 138, 109340 (2023)

    Article  Google Scholar 

  9. Fang, U., Li, M., Li, J., Gao, L., Jia, T., Zhang, Y.: A comprehensive survey on multi-view clustering. IEEE Trans. Knowl. Data Eng. 35, 12350–12368 (2023)

    Article  Google Scholar 

  10. Fletcher, L., Loy, G., Barnes, N., Zelinsky, A.: Correlating driver gaze with the road scene for driver assistance systems. Robot. Auton. Syst. 52(1), 71–84 (2005)

    Article  Google Scholar 

  11. Garg, D., Wang, Y., Hariharan, B., Campbell, M., Weinberger, K.Q., Chao, W.L.: Wasserstein distances for stereo disparity estimation. Adv. Neural. Inf. Process. Syst. 33, 22517–22529 (2020)

    Google Scholar 

  12. Girshick, R.: Fast r-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)

    Google Scholar 

  13. Griewank, A., Walther, A.: Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Trans. Math. Softw. 26(1), 19–45 (2000)

    Article  Google Scholar 

  14. Gu, X., Wang, Y., Wu, C., Lee, Y.J., Wang, P.: HPLFlowNet: hierarchical permutohedral lattice FlowNet for scene flow estimation on large-scale point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3254–3263 (2019)

    Google Scholar 

  15. Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3273–3282 (2019)

    Google Scholar 

  16. Hur, J., Roth, S.: Self-supervised monocular scene flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7396–7405 (2020)

    Google Scholar 

  17. Ilg, E., Saikia, T., Keuper, M., Brox, T.: Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 614–630 (2018)

    Google Scholar 

  18. Jia, X., et al.: Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes. arXiv preprint arXiv:1807.11205 (2018)

  19. Jiang, H., Sun, D., Jampani, V., Lv, Z., Learned-Miller, E., Kautz, J.: Sense: A shared encoder network for scene-flow estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3195–3204 (2019)

    Google Scholar 

  20. Kukkala, V.K., Tunnell, J., Pasricha, S., Bradley, T.: Advanced driver-assistance systems: a path toward autonomous vehicles. IEEE Consum. Electron. Mag. 7(5), 18–25 (2018)

    Article  Google Scholar 

  21. Li, Z., et al.: Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6197–6206 (2021)

    Google Scholar 

  22. Liu, X., et al.: Extremely dense point correspondences using a learned feature descriptor. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4847–4856 (2020)

    Google Scholar 

  23. Ma, W.C., Wang, S., Hu, R., Xiong, Y., Urtasun, R.: Deep rigid instance scene flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3614–3622 (2019)

    Google Scholar 

  24. Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)

    Google Scholar 

  25. Micikevicius, P., et al.: Mixed precision training. arXiv preprint arXiv:1710.03740 (2017)

  26. Mukherjee, S., Guddeti, R.M.R.: A hybrid algorithm for disparity calculation from sparse disparity estimates based on stereo vision. In: 2014 International Conference on Signal Processing and Communications (SPCOM), pp. 1–6. IEEE (2014)

    Google Scholar 

  27. Özçift, A., Akarsu, K., Yumuk, F., Söylemez, C.: Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish. Automatika: časopis za automatiku, mjerenje, elektroniku, računarstvo i komunikacije 62(2), 226–238 (2021)

    Google Scholar 

  28. Pajić, V., Govedarica, M., Amović, M.: Model of point cloud data management system in big data paradigm. ISPRS Int. J. Geo Inf. 7(7), 265 (2018)

    Article  Google Scholar 

  29. de Queiroz Mendes, R., Ribeiro, E.G., dos Santos Rosa, N., Grassi, V., Jr.: On deep learning techniques to boost monocular depth estimation for autonomous navigation. Robot. Auton. Syst. 136, 103701 (2021)

    Article  Google Scholar 

  30. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4938–4947 (2020)

    Google Scholar 

  31. Shen, M., Gu, Y., Liu, N., Yang, G.Z.: Context-aware depth and pose estimation for bronchoscopic navigation. IEEE Robot. Autom. Lett. 4(2), 732–739 (2019)

    Article  Google Scholar 

  32. Vallender, S.: Calculation of the wasserstein distance between probability distributions on the line. Theory Probab. Appl. 18(4), 784–786 (1974)

    Article  Google Scholar 

  33. Vegeshna, V.P.K.V.: Stereo matching with color-weighted correlation, hierarchical belief propagation and occlusion handling. arXiv preprint arXiv:1708.07987 (2017)

  34. Wang, L., Ren, J., Xu, B., Li, J., Luo, W., Xia, F.: Model: motif-based deep feature learning for link prediction. IEEE Trans. Comput. Soc. Syst. 7(2), 503–516 (2020)

    Article  Google Scholar 

  35. Wang, Z., Li, S., Howard-Jenkins, H., Prisacariu, V., Chen, M.: FlowNet3D++: geometric losses for deep scene flow estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 91–98 (2020)

    Google Scholar 

  36. Xie, Z., Chen, S., Orchard, G.: Event-based stereo depth estimation using belief propagation. Front. Neurosci. 11, 535 (2017)

    Article  Google Scholar 

  37. Xu, C., Guan, Z., Zhao, W., Wu, H., Niu, Y., Ling, B.: Adversarial incomplete multi-view clustering. In: IJCAI, vol. 7, pp. 3933–3939 (2019)

    Google Scholar 

  38. Xu, C., Zhao, W., Zhao, J., Guan, Z., Song, X., Li, J.: Uncertainty-aware multiview deep learning for internet of things applications. IEEE Trans. Industr. Inf. 19(2), 1456–1466 (2022)

    Article  Google Scholar 

  39. Xu, D., Wang, W., Tang, H., Liu, H., Sebe, N., Ricci, E.: Structured attention guided convolutional neural fields for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3917–3925 (2018)

    Google Scholar 

  40. Xu, H., Zhang, J.: AANet: adaptive aggregation network for efficient stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1959–1968 (2020)

    Google Scholar 

  41. Yee, K., Chakrabarti, A.: Fast deep stereo with 2D convolutional processing of cost signatures. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 183–191 (2020)

    Google Scholar 

  42. Yin, H., Yang, S., Song, X., Liu, W., Li, J.: Deep fusion of multimodal features for social media retweet time prediction. World Wide Web 24, 1027–1044 (2021)

    Article  Google Scholar 

  43. Zhang, F., Prisacariu, V., Yang, R., Torr, P.H.: GA-Net: guided aggregation net for end-to-end stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 185–194 (2019)

    Google Scholar 

  44. Zhou, C., Yan, Q., Shi, Y., Sun, L.: DoubleStar: long-range attack towards depth estimation based obstacle avoidance in autonomous systems. arXiv preprint arXiv:2110.03154 (2021)

Download references

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62072053 and the National Natural Fund Joint Fund Project under Grant No. U21B2041.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huansheng Song .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, Q., Rakai, L., Sun, S., Song, H., Song, X., Akhtar, N. (2024). STTR-3D: Stereo Transformer 3D Network for Video-Based Disparity Change Estimation. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14334. Springer, Singapore. https://doi.org/10.1007/978-981-97-2421-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2421-5_15

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2420-8

  • Online ISBN: 978-981-97-2421-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics