Abstract
In the field of computer vision and stereo depth estimation, there has been little research in obtaining high-accuracy disparity change maps from two-dimensional images. This map offers information that fills the gap between optical flow and depth which is desirable for numerous academic research problems and industrial applications, such as navigation systems, driving assistance, and autonomous systems. We introduce STTR3D, a 3D extension of the STereo TRansformer (STTR) which leverages transformers and an attention mechanism to handle stereo depth estimation. We further make use of the Scene Flow FlyingThings3D dataset which openly includes data for disparity change and apply 1) refinements through the use of MLP over relative position encoding and 2) regression head with an entropy-regularized optimal transport to obtain a disparity change map. This model consistently demonstrates superior performance for depth estimation tasks as compared to the original model. Compared to the existing supervised learning methods for estimating stereo depth, our technique simultaneously handles disparity estimation and the disparity change problem with an end-to-end network, also establishing that the addition of our transformer yields improved performance that achieves high precision for both issues.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aich, S., Vianney, J.M.U., Islam, M.A., Liu, M.K.B.: Bidirectional attention network for monocular depth estimation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 11746–11752. IEEE (2021)
Badki, A., Troccoli, A., Kim, K., Kautz, J., Sen, P., Gallo, O.: Bi3D: stereo depth estimation via binary classifications. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1600–1608 (2020)
Behl, A., Hosseini Jafari, O., Karthik Mustikovela, S., Abu Alhaija, H., Rother, C., Geiger, A.: Bounding boxes, segmentations and object coordinates: How important is recognition for 3D scene flow estimation in autonomous driving scenarios? In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2574–2583 (2017)
Behl, A., Paschalidou, D., Donné, S., Geiger, A.: PointFlowNet: learning representations for rigid motion estimation from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7962–7971 (2019)
Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418 (2018)
Diamantas, S.C., Oikonomidis, A., Crowder, R.M.: Depth estimation for autonomous robot navigation: a comparative approach. In: 2010 IEEE International Conference on Imaging Systems and Techniques, pp. 426–430. IEEE (2010)
Dong, Q., Feng, J.: Outlier detection and disparity refinement in stereo matching. J. Vis. Commun. Image Represent. 60, 380–390 (2019)
Fang, U., Li, J., Lu, X., Mian, A., Gu, Z.: Robust image clustering via context-aware contrastive graph learning. Pattern Recognit. 138, 109340 (2023)
Fang, U., Li, M., Li, J., Gao, L., Jia, T., Zhang, Y.: A comprehensive survey on multi-view clustering. IEEE Trans. Knowl. Data Eng. 35, 12350–12368 (2023)
Fletcher, L., Loy, G., Barnes, N., Zelinsky, A.: Correlating driver gaze with the road scene for driver assistance systems. Robot. Auton. Syst. 52(1), 71–84 (2005)
Garg, D., Wang, Y., Hariharan, B., Campbell, M., Weinberger, K.Q., Chao, W.L.: Wasserstein distances for stereo disparity estimation. Adv. Neural. Inf. Process. Syst. 33, 22517–22529 (2020)
Girshick, R.: Fast r-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Griewank, A., Walther, A.: Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation. ACM Trans. Math. Softw. 26(1), 19–45 (2000)
Gu, X., Wang, Y., Wu, C., Lee, Y.J., Wang, P.: HPLFlowNet: hierarchical permutohedral lattice FlowNet for scene flow estimation on large-scale point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3254–3263 (2019)
Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3273–3282 (2019)
Hur, J., Roth, S.: Self-supervised monocular scene flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7396–7405 (2020)
Ilg, E., Saikia, T., Keuper, M., Brox, T.: Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 614–630 (2018)
Jia, X., et al.: Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes. arXiv preprint arXiv:1807.11205 (2018)
Jiang, H., Sun, D., Jampani, V., Lv, Z., Learned-Miller, E., Kautz, J.: Sense: A shared encoder network for scene-flow estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3195–3204 (2019)
Kukkala, V.K., Tunnell, J., Pasricha, S., Bradley, T.: Advanced driver-assistance systems: a path toward autonomous vehicles. IEEE Consum. Electron. Mag. 7(5), 18–25 (2018)
Li, Z., et al.: Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6197–6206 (2021)
Liu, X., et al.: Extremely dense point correspondences using a learned feature descriptor. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4847–4856 (2020)
Ma, W.C., Wang, S., Hu, R., Xiong, Y., Urtasun, R.: Deep rigid instance scene flow. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3614–3622 (2019)
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)
Micikevicius, P., et al.: Mixed precision training. arXiv preprint arXiv:1710.03740 (2017)
Mukherjee, S., Guddeti, R.M.R.: A hybrid algorithm for disparity calculation from sparse disparity estimates based on stereo vision. In: 2014 International Conference on Signal Processing and Communications (SPCOM), pp. 1–6. IEEE (2014)
Özçift, A., Akarsu, K., Yumuk, F., Söylemez, C.: Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish. Automatika: časopis za automatiku, mjerenje, elektroniku, računarstvo i komunikacije 62(2), 226–238 (2021)
Pajić, V., Govedarica, M., Amović, M.: Model of point cloud data management system in big data paradigm. ISPRS Int. J. Geo Inf. 7(7), 265 (2018)
de Queiroz Mendes, R., Ribeiro, E.G., dos Santos Rosa, N., Grassi, V., Jr.: On deep learning techniques to boost monocular depth estimation for autonomous navigation. Robot. Auton. Syst. 136, 103701 (2021)
Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4938–4947 (2020)
Shen, M., Gu, Y., Liu, N., Yang, G.Z.: Context-aware depth and pose estimation for bronchoscopic navigation. IEEE Robot. Autom. Lett. 4(2), 732–739 (2019)
Vallender, S.: Calculation of the wasserstein distance between probability distributions on the line. Theory Probab. Appl. 18(4), 784–786 (1974)
Vegeshna, V.P.K.V.: Stereo matching with color-weighted correlation, hierarchical belief propagation and occlusion handling. arXiv preprint arXiv:1708.07987 (2017)
Wang, L., Ren, J., Xu, B., Li, J., Luo, W., Xia, F.: Model: motif-based deep feature learning for link prediction. IEEE Trans. Comput. Soc. Syst. 7(2), 503–516 (2020)
Wang, Z., Li, S., Howard-Jenkins, H., Prisacariu, V., Chen, M.: FlowNet3D++: geometric losses for deep scene flow estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 91–98 (2020)
Xie, Z., Chen, S., Orchard, G.: Event-based stereo depth estimation using belief propagation. Front. Neurosci. 11, 535 (2017)
Xu, C., Guan, Z., Zhao, W., Wu, H., Niu, Y., Ling, B.: Adversarial incomplete multi-view clustering. In: IJCAI, vol. 7, pp. 3933–3939 (2019)
Xu, C., Zhao, W., Zhao, J., Guan, Z., Song, X., Li, J.: Uncertainty-aware multiview deep learning for internet of things applications. IEEE Trans. Industr. Inf. 19(2), 1456–1466 (2022)
Xu, D., Wang, W., Tang, H., Liu, H., Sebe, N., Ricci, E.: Structured attention guided convolutional neural fields for monocular depth estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3917–3925 (2018)
Xu, H., Zhang, J.: AANet: adaptive aggregation network for efficient stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1959–1968 (2020)
Yee, K., Chakrabarti, A.: Fast deep stereo with 2D convolutional processing of cost signatures. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 183–191 (2020)
Yin, H., Yang, S., Song, X., Liu, W., Li, J.: Deep fusion of multimodal features for social media retweet time prediction. World Wide Web 24, 1027–1044 (2021)
Zhang, F., Prisacariu, V., Yang, R., Torr, P.H.: GA-Net: guided aggregation net for end-to-end stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 185–194 (2019)
Zhou, C., Yan, Q., Shi, Y., Sun, L.: DoubleStar: long-range attack towards depth estimation based obstacle avoidance in autonomous systems. arXiv preprint arXiv:2110.03154 (2021)
Acknowledgment
This work was supported in part by the National Natural Science Foundation of China under Grant No. 62072053 and the National Natural Fund Joint Fund Project under Grant No. U21B2041.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yang, Q., Rakai, L., Sun, S., Song, H., Song, X., Akhtar, N. (2024). STTR-3D: Stereo Transformer 3D Network for Video-Based Disparity Change Estimation. In: Song, X., Feng, R., Chen, Y., Li, J., Min, G. (eds) Web and Big Data. APWeb-WAIM 2023. Lecture Notes in Computer Science, vol 14334. Springer, Singapore. https://doi.org/10.1007/978-981-97-2421-5_15
Download citation
DOI: https://doi.org/10.1007/978-981-97-2421-5_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2420-8
Online ISBN: 978-981-97-2421-5
eBook Packages: Computer ScienceComputer Science (R0)