Skip to main content
Log in

Accurate structure from motion using consistent cluster merging

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The incremental Structure-from-Motion approach is widely used for scene reconstruction as it is robust to outliers. However, the method suffers from two major limitations: error accumulation and heavy time consumption. To alleviate these problems, we propose a redundant cluster merge approach which is effective and efficient. Different from the previous clustering methods, each cluster has only one overlapping adjacent cluster. each of the sub clusters divided by our approach has several adjacent cluster candidates with overlapping. In addition, these cluster candidates are verified whether they are suitable for merging. By selecting the correct estimated clusters, cluster merging achieves more accurate results. The cluster verification is implemented based on the fact that the correctly estimated clusters have consistent point cloud and extrinsic camera parameters in each image of the same scene will be formulated as two constraints. In addition, we introduce a feature matching consistency constraint to eliminate the falsely matched feature pairs. The gain in accuracy of feature matching leads to better estimated results in each cluster. Experiments were performed on three public datasets. The reconstruction results show that our method outperformed state-of-the-art SfM approaches in terms of both efficiency and accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Agarwal S, Snavely N, Simon I, Szeliski SR (2009) Building Rome in a day. In: Proc. Int. Conf. Comput. Vis., pp 72–79

  2. Bhowmick B, Patra S, Chatterjee A, Govindu VM, Banerjee S (2017) Divide and conquer: a hierarchical approach to large-scale structure-from-motion. Comput Vis Image Understanding 157:190–205

    Article  Google Scholar 

  3. Bian J, Lin W, Matsushita Y, Yeung S, Nguyen T, Cheng M (2007) GMS: Grid-based motion statistics for fast, ultra-robust feature correspondence. In: Proc. IEEE conf. Computer vision and pattern recognition, pp 2828–2837

  4. Chatterjee A, Govindu VM (2018) Robust relative rotation averaging. IEEE Trans Pattern Anal Mach Intell 40(4):958–972

    Article  Google Scholar 

  5. Crandall DJ, Owens A, Snavely N, Huttenlocher DP (2011) Discrete-continuous optimization for large-scale structure from motion. In: Proc. Int. Conf. Comput. Vis. 2011, pp 3001–3008

  6. Cui H, Gao X, Shen S, Hu Z (2017) HSFm: hybrid Structure-from-Motion. In: Proc. IEEE conf. Computer vision and pattern recognition, pp 2393–2402

  7. Cui H, Shen S, Gao X, Hu Z (2017) Batched incremental structure-from-motion. In: Proc. Int. Conf. 3d. Vis. pp 205–214

  8. Cui H, Shen S, Gao X, Hu Z (2017) CSFm: community-based Structure from Motion. In: Proc IEEE int Conf image processing, pp 1–5

  9. Dai Y, Li H, He M (2013) Projective multiview structure and motion from element-wise factorization. IEEE Trans Pattern Anal Mach Intell 35 (9):2238–2251

    Article  Google Scholar 

  10. Fischler MA, Bolles RC (1981) Random Sample Consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24(6):381–395

    Article  MathSciNet  Google Scholar 

  11. Furukawa Y, Ponce J (2010) Accurate, dense, and robust multi-view stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 32(8):1362–1376

    Article  Google Scholar 

  12. Gherardi R, Farenzena M, Fusiello A (2010) Improving the efficiency of hierarchical structure-and-motion. In: Proc. IEEE conf Computer vision and pattern recognition, pp 1594–1600

  13. Hartley R (1995) In defence of the 8-point algorithm. In: Proc. Int. Conf. Comput. Vis., pp 1064–1070

  14. Hartley R, Sturm P (1997) Triangulation. Comput Vis Image Understand 68(2):146–157

    Article  Google Scholar 

  15. Hartley R, Trumpf J, Dai Y, Li H (2013) Rotation averaging. Int J Comput Vis 103(3):267–305

    Article  MathSciNet  Google Scholar 

  16. Jiang N, Tan P, Cheong LF (2012) Seeing double without confusion: Structure-from-motion in highly ambiguous scenes. In: Proc. IEEE conf. Computer vision and pattern recognition, pp 1458–1465

  17. Kahl F, Hartley R (2008) Multiple-view geometry under the \(l_{\infty } \)-norm. IEEE Trans Pattern Anal Mach Intell 30(9):1603–1617

    Article  Google Scholar 

  18. Lin WY, Liu S, Jiang N, Do MN, Tan P, Lu J (2016) Repmatch: Robust feature matching and pose for reconstructing modern cities. In: Proc. Int. Conf. Comput. Vis, pp 562–579

  19. Lipman Y, Yagev S, Poranne R, Jacobs DW, Basri R (2014) Feature matching with bounded distortion. ACM Trans On Graphics 33(3):1–14

    Article  Google Scholar 

  20. Lowe DG (1999) Object recognition from local scale-invariant features. In: Proc. Int. Conf. Comput. Vis., pp 1150–1157

  21. Magerand L, Bue AD (2017) Practical projective structure from motion (p2sfm). In: Proc. Int. Conf. Comput. Vis. pp 39–47

  22. Martinec D, Pajdla T (2007) Robust rotation and translation estimation in multiview reconstruction. In: Proc. IEEE conf. Computer vision and pattern recognition, pp. 1–8

  23. Moulon P, Monasse P, Marlet R (2013) Global fusion of relative motions for robust, accurate and scalable structure from motion. In: Proc. Int. Conf. Comput. Vis, pp 3248–3255

  24. Nistér D (2004) An efficient solution to the five-point relative pose problem. IEEE Trans Pattern Anal Mach Intell 26(6):756–770

    Article  Google Scholar 

  25. Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: Proc. IEEE conf. Computer vision and pattern recognition, pp 2161–2168

  26. Pedersini F, Pigazzini P, Sarti A, Tubaro S (2000) Multicamera motion estimation for high-accuracy 3D reconstruction. Signal Process 80:1–21

    Article  Google Scholar 

  27. Roberts R, Sinha SN, Szeliski R, Steedly D (2011) Structure from motion for scenes with large duplicate structures. In: Proc. IEEE conf. Computer vision and pattern recognition, pp 3137–3144

  28. Rousseeuw RJ (1984) Least median of squares regression. J Amer Stat Assoc 79(388):871–880

    Article  MathSciNet  Google Scholar 

  29. Schönberger JL, Frahm J (2016) Structure-from-motion revisited. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp 4104–4113

  30. Schönberger JL, Zheng E, Pollefeys M, Frahm J (2016) Pixelwise view selection for unstructured multi-view stereo. In: European conference on computer vision, pp 501–518

  31. Shen T, Zhu S, Fang T, Zhang R, Quan L (2016) Graph-based consistent matching for structure-from-motion. In: European conference on computer vision, pp 139–155

  32. Snavely N, Seitz SM, Szeliski R (2008) Modeling the world from Internet photo collections. Int J Comput Vis 80:189–210

    Article  Google Scholar 

  33. Snavely N, Seitz S, Szeliski R (2008) Skeletal graphs for efficient structure from motion. In: Proc. IEEE conf. Computer vision and pattern recognition, pp 1–8

  34. Strecha C, Hansen WV, Gool LV, Fua P, Thoennessen U (2008) On benchmarking camera calibration and multi-view stereo for high resolution imagery. In: Proc. IEEE conf. Computer vision and pattern recognition, pp 1–8

  35. Sturm J, Engelhard N, Endres F, Burgard W, Cremers D (2012) A benchmark for the evaluation of RGB-d SLAM systems. In: IEEE/RSJ International conference on intelligent robots and systems (IROS), pp 573–580

  36. Toldo R, Gherardi R, Farenzena M, Fusiello A (2015) Hierarchical structureand motion recovery from uncalibrated images. Comput Vis Image Understanding 140:127–143

    Article  Google Scholar 

  37. Triggs B, McLauchlan PF, Hartley R, Fitzgibbon AW (1999) Bundle adjustment a modern synthesis. In: International workshop on vision algorithms, pp 298–372

  38. Wilson K, Snavely N (2014) Network principles for sfm: disambiguating repeated structures with local context. In: Proc. Int. Conf. Comput. Vis, pp 513–520

  39. Wu C (2007) Siftgpu: A gpu implementation of scale invariant feature transform (sift)

  40. Wu C (2013) Towards linear-time incremental structure from motion. In: Proc. Int. Conf. 3d. Vis., pp 127–134

  41. Wu C (2020) Visualsfm: A visual structure from motion system, http://ccwu.me/vsfm/

  42. Zach C (2020) SSBA 3.0, https://github.com/eokeeffe/SSBA

  43. Zach C, Klopschitz M, Pollefeys M (2010) Disambiguating visual relations using loop constraints. In: Proc. IEEE conf. Computer vision and pattern recognition, pp 1426–1433

  44. Zhu Q, Wang H, Tian W (2009) A practical new approach to 3D scene recovery. Signal Process 89:2152–2158

    Article  Google Scholar 

  45. Zhu S, Zhang R, Zhou L, Shen T, Fang T, Tan P, Quan L (2018) Very large-scale global SfM by distributed motion averaging. In: Proc. IEEE conf. Computer vision and pattern recognition, pp 4568–4577

Download references

Acknowledgements

This research is supported in part by the Natural Science Foundation of Hunan Province (No. 2017JJ2252).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luming Liang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Coordinate transformation

Appendix A: Coordinate transformation

1.1 A.1 Method 1: Coordinate transform between two clusters

Given two adjacent clusters with the same images, the images in the first cluster and in the second cluster are set as \( {I_{f}^{k}}\) and \({I_{s}^{k}}\), respectively. The estimated 3D points from the first cluster transformed into the image \( {I_{f}^{k}}\) local coordinate system satisfy

$$ {\textbf{X}}_{cam{\text{1}}} = {\textbf{R}}_{1} {\textbf{X}} + {\textbf{t}}_{1} , $$
(7)

where X is the 3D global coordinate of a point obtained in the first cluster, and R1,t1 are the estimated relative extrinsic camera parameters corresponding to image \( {I_{f}^{k}}\). Xcam1 is the transformed coordinate in the image \( {I_{f}^{k}}\) local coordinate system. Equation (7) can be rewritten as

$$ {\textbf{X}} = {\textbf{R}}_{1}^{- 1} ({\textbf{X}}_{cam{\text{1}}} - {\textbf{t}}_{1} ) = {\textbf{R}}_{1}^{- 1} {\textbf{X}}_{cam{\text{1}}} - {\textbf{R}}_{1}^{- 1} {\textbf{t}}_{1}. $$
(8)

Assuming the local coordinate system of image \( {I_{s}^{k}}\) is set as the global coordinate system in the second cluster, the 3D coordinates of points estimated in the second cluster are denoted as X2. Since image \( {I_{f}^{k}}\) and image \( {I_{s}^{k}}\) are the same image, therefore, X2 = Xcam1. From equation (8), we obtain

$$ {\textbf{X}} = {\textbf{R}}_{1}^{- 1} {\textbf{X}}_{2} - {\textbf{R}}_{1}^{- 1} {\textbf{t}}_{1} . $$
(9)

Given one image in the second cluster, \( {I_{s}^{i}}\), we explain how to estimate the parameters R,t that transforms the extrinsic camera parameters corresponding to \( {I_{s}^{i}}\) in the second cluster coordinate system into the first cluster coordinate system as follows:

The 3D global coordinates of a point obtained in the first cluster, X, and the 3D coordinate of the point transformed into \( {I_{s}^{i}}\) local coordinate system, Xcam2, satisfy

$$ {\textbf{X}}_{cam2} {\text{ = }}{\textbf{RX}} + {\textbf{t}}, $$
(10)

where X is the 3D global coordinate of the point defined as (A.3), and R,t are the extrinsic camera parameters corresponding to image \( {I_{s}^{i}}\) in the first cluster coordinate system.

With the substitution of (9), Formula (10) can be rewritten as

$$ {\textbf{X}}_{cam2} = {\textbf{R}}({\textbf{R}}_{1}^{- 1} {\textbf{X}}_{\text{2}} - {\textbf{R}}_{1}^{- 1} {\textbf{t}}_{1} ) + {\textbf{t}} = {\textbf{RR}}_{1}^{- 1} {\textbf{X}}_{\text{2}} - {\textbf{RR}}_{1}^{- 1} {\textbf{t}}_{1} + {\textbf{t}}. $$
(11)

The 3D coordinate of this point transformed into \( {I_{s}^{i}}\) local coordinate system can also be defined as

$$ {\textbf{X}}_{cam2} = {\textbf{R}}_{2} {\textbf{X}}_{2} + {\textbf{t}}_{2} , $$
(12)

where R2,t2 are the estimated extrinsic camera parameters corresponding to \( {I_{s}^{i}}\) in the second cluster, and X2 is the estimated 3D global coordinate of the point in the second cluster.

Since the coordinates of the point transformed into the \( {I_{s}^{i}}\) local coordinate system are unchanged, (12) equals to (11), and can be formulated as

$$ {\textbf{RR}}_{1}^{- 1} {\textbf{X}}_{\text{2}} - {\textbf{RR}}_{1}^{- 1} {\textbf{t}}_{1} + {\textbf{t}} = {\textbf{R}}_{2} {\textbf{X}}_{2} + {\textbf{t}}_{2} . $$
(13)

From (13), we obtain

$$ {\textbf{RR}}_{1}^{- 1} = {\textbf{R}}_{2} \Rightarrow {\textbf{R}} = {\textbf{R}}_{2} {\textbf{R}}_{1}. $$
(14)
$$ - {\textbf{RR}}_{1}^{- 1} {\textbf{t}}_{1} + {\textbf{t}} = {\textbf{t}}_{2} \Rightarrow {\textbf{t}} = {\textbf{RR}}_{1}^{- 1} {\textbf{t}}_{1} + {\textbf{t}}_{2} = {\textbf{R}}_{2} {\textbf{t}}_{1} + {\textbf{t}}_{2}. $$
(15)

1.2 A.2 Method 2: Coordinate transformation in the new global coordinate system

According to the previous derivation, if the \( {I_{f}^{k}}\) local coordinate system is set to the global coordinate system in the first cluster, we have R1 = I; t1 = 0. From equations (14) and (15), R and t are calculated as R = R2; t = t2, which means that the estimated extrinsic parameters of a camera in the second cluster can be estimated as the extrinsic parameters of this camera in the first cluster global coordinate system if the global coordinate system in each cluster is set as the same image local coordinate system.

Given an image in a cluster, Ig, we describe how the estimated extrinsic parameters of one camera in the cluster are transformed into the coordinate system with the local coordinate system of Ig as the cluster global coordinate system as follows.

We denote the extrinsic camera parameters corresponding to image Ig as Rg and tg. The estimated coordinates of a point, X, transformed into the Ig image local coordinate system as Xcam, can be formulated as:

$$ {\textbf{X}}_{cam} = {\textbf{R}}_{g} {\textbf{X}} + {\textbf{t}}_{g}. $$
(16)

Let R,t be the extrinsic camera parameters of the Nth image after transformation. Since the local coordinate system of image Ig is set as the global coordinate system after transformation, after transformation, the coordinate of the point in the Nth image local coordinate system is subject to

$$ {\textbf{RX}}_{cam} + {\textbf{t}} = {\textbf{R}}\left( {{\textbf{R}}_{g} {\textbf{X}} + {\textbf{t}}_{g} } \right) + {\textbf{t}} = {\textbf{RR}}_{g} {\textbf{X}}{\text{ + }}{\textbf{Rt}}_{g} + {\textbf{t}}. $$
(17)

Since the coordinate of the point transformed into the Nth image local coordinate system is unchanged, therefore, we have

$$ {\textbf{RR}}_{g} {\textbf{X}}{\text{ + }}{\textbf{Rt}}_{g} + {\textbf{t}} = {\textbf{R}}_{n} {\textbf{X}} + {\textbf{t}}_{n} , $$
(18)

where Rn and tn are the estimated extrinsic camera parameters corresponding to the Nth image before transformation.

According to (18), we have

$$ {\textbf{RR}}_{g} = {\textbf{R}}_{n} ,{\textbf{Rt}}_{g} + {\textbf{t}} = {\textbf{t}}_{n}. $$
(19)

From (19), we conclude that the extrinsic camera parameters correspond to the Nth image transformed into the coordinate system with the local coordinate system of Ig as the cluster global coordinate system satisfy

$$ {\textbf{R}} = {\textbf{R}}_{n} {\textbf{R}}_{g}^{- 1} ,{\textbf{t}} = {\textbf{t}}_{n} - {\textbf{Rt}}_{g}. $$
(20)

Thus, the coordinates of a point are transformed into the coordinate in the new global coordinate system are estimated as

$$ {\textbf{X}}^{\prime} = {\textbf{R}}_{g} {\textbf{X}} + {\textbf{t}}_{g}. $$
(21)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, S., Liang, L. & Ouyang, J. Accurate structure from motion using consistent cluster merging. Multimed Tools Appl 81, 24913–24935 (2022). https://doi.org/10.1007/s11042-022-12202-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12202-w

Keywords

Navigation