Abstract
The efficient manipulation of randomly placed objects relies on the accurate estimation of their 6 DoF geometrical configuration. In this paper we tackle this issue by following the intuitive idea that different objects, viewed from the same perspective, should share identical poses and, moreover, these should be efficiently projected onto a well-defined and highly distinguishable subspace. This hypothesis is formulated here by the introduction of pose manifolds relying on a bunch-based structure that incorporates unsupervised clustering of the abstracted visual cues and encapsulates appearance and geometrical properties of the objects. The resulting pose manifolds represent the displacements among any of the extracted bunch points and the two foci of an ellipse fitted over the members of the bunch-based structure. We post-process the established pose manifolds via \(l_1\) norm minimization so as to build sparse and highly representative input vectors that are characterized by large discrimination capabilities. While other approaches for robot grasping build high dimensional input vectors, thus increasing the complexity of the system, in contrast, our method establishes highly distinguishable manifolds of low dimensionality. This paper represents the first integrated research endeavor in formulating sparse pose manifolds, with experimental results providing evidence of low generalization error, justifying thus our theoretical claims.









Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Grasping a pliers:http://www.youtube.com/watch?v=J_gpPu6ZYQQ, Grasping a box-shaped object:http://www.youtube.com/watch?v=cOF0RdeJ6Zg, pose estimation of a car: http://www.youtube.com/watch?v=CSoFMk48DmM, pose estimation of a box-shaped object http://www.youtube.com/watch?v=RTe6usXm9qs.
References
Agrawal, A., Sun, Y., Barnwell, J., & Raskar, R. (2010). Vision-guided robot system for picking objects by casting shadows. IJRR, 29, 155–173.
Andreopoulos, A., Tsotsos, J. (2009). A theory of active object localization. ICCV (pp. 903–910).
Baudat, G., & Anouar, F. (2000). Generalized discriminant analysis using a kernel approach. Neural Computation, 12, 2385–2404.
Ben Amor, H., Kroemer, O., Hillenbrand, U., Neumann, G., Peters, J. (2012). Generalization of human grasping for multi-fingered robot hands. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on. (pp. 2043–2050).
Berg, A., Berg, T., & Malik, J. (2005). Shape matching and object recognition using low distortion correspondences. CVPR, 1, 26–33.
Bishop, C. (2006). Pattern recognition and machine learning. Volume 4. New York: springer.
Blender. (2011). Blender 3d model creator. (http://www.blender.org/)
Bohg, J., Kragic, D. (2009). Grasping familiar objects using shape context. International Conference on Advanced Robotics (pp. 1–6).
Castrodad, A., & Sapiro, G. (2012). Sparse modeling of human actions from motion imagery. International journal of computer vision, 100, 1–15.
Chan, A., Croft, E., Little, J. (2011). Constrained manipulator visual servoing (cmvs): Rapid robot programming in cluttered workspaces. IROS (pp. 2825–2830).
Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decomposition by basis pursuit. SIAM Review, 43, 129–159.
Cheng, B., Yang, J., Yan, S., Fu, Y., & Huang, T. S. (2010). Learning with l1-graph for image analysis. IEEE Transactions on Image Processing, 19, 858–866.
Choi, C., Baek, S., Lee, S. (2008). Real-time 3d object pose estimation and tracking for natural landmark based visual servo. IROS (pp. 3983–3989).
Detry, R., Piater, J. (2011). Continuous surface-point distributions for 3d object pose estimation and recognition. ACCV (pp. 572–585).
Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern Classification (2nd ed.). New York: Wiley.
Eberhart, R., Shi, Y., & Kennedy, J. (2001). Swarm intelligence. The Morgan Kaufmann Series in Evolutionary Computation. San Francisco: Morgan Kaufmann.
Evermotion, T.M. (2012). Evermotion 3d models. (http://www.evermotion.org/)
Fergus, R., Perona, P., & Zisserman, A. (2007). Weakly supervised scale-invariant learning of models for visual recognition. IJCV, 71, 273–303.
Ferrari, V., Tuytelaars, T., & Van Gool, L. (2006). Simultaneous object recognition and segmentation from single or multiple model views. IJCV, 67, 159–188.
Fischler, M., & Bolles, R. (1981). Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24, 381–395.
Guha, T., & Ward, R. K. (2012). Learning sparse representations for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34, 1576–1588.
Hebert, P., Hudson, N., Ma, J., Howard, T., Fuchs, T., Bajracharya, M., Burdick, J. (2012). Combined shape, appearance and silhouette for simultaneous manipulator and object tracking. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, IEEE (pp. 2405–2412).
Hinterstoisser, S., Benhimane, S., Navab, N. (2007). N3m: Natural 3d markers for real-time object detection and pose estimation. ICCV (pp. 1–7).
Hsiao, E., Collet, A., Hebert, M. (2010). Making specific features less discriminative to improve point-based 3d object recognition. CVPR (pp. 2653–2660).
Jolliffe, I. (1986). Principal Component Analysis. New York: Springer Verlag.
Kouskouridas, R., Gasteratos, A., & Badekas, E. (2012). Evaluation of two-part algorithms for objects’ depth estimation. Computer Vision, IET, 6, 70–78.
Kouskouridas, R., Gasteratos, A., & Emmanouilidis, C. (2013). Efficient representation and feature extraction for neural network-based 3d object pose estimation. Neurocomputing, 120, 90–100.
Kragic, D., Björkman, M., Christensen, H., & Eklundh, J. (2005). Vision for robotic object manipulation in domestic settings. RAS, 52, 85–100.
Krainin, M., Henry, P., Ren, X., & Fox, D. (2011). Manipulator and object tracking for in-hand 3d object modeling. IJRR, 30, 1311–1327.
Leibe, B., Leonardis, A., Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. Workshop, ECCV (pp. 17–32).
Lippiello, V., Siciliano, B., & Villani, L. (2007). Position-based visual servoing in industrial multirobot cells using a hybrid camera configuration. IEEE Transactions on Robotics, 23, 73–86.
Lippiello, V., Ruggiero, F., & Siciliano, B. (2011). Floating visual grasp of unknown objects using an elastic reconstruction surface. IJRR, 70, 329–344.
Lowe, D. (1999). Object recognition from local scale-invariant features. ICCV, 2, 1150–1157.
Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60, 91–110.
Ma, J., Chung, T., & Burdick, J. (2011). A probabilistic framework for object search with 6-dof pose estimation. IJRR, 30, 1209–1228.
Mason, M., Rodriguez, A., & Srinivasa, S. (2012). Autonomous manipulation with a general-purpose simple hand. IJRR, 31, 688–703.
Mei, L., Liu, J., Hero, A., Savarese, S. (2011). Robust object pose estimation via statistical manifold modeling. ICCV (pp. 967–974).
Mei, L., Sun, M., M.Carter, K., III, A.O.H., Savarese, S. (2009). Object pose classification from short video sequences. BMVC.
Nayar, S., Nene, S., Murase, H. (1996). Columbia object image library (coil 100). Technical report, Tech. Report No. CUCS-006-96. Department of Comp. Science, Columbia University.
Oikonomidis, I., Kyriazis, N., Argyros, A. (2011). Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints. ICCV (pp. 2088–2095).
Pang, Y., Li, X., & Yuan, Y. (2010). Robust tensor analysis with l1-norm. IEEE Transactions on Circuits and Systems for Video Technology, 20, 172–178.
Popovic, M., Kraft, D., Bodenhagen, L., Baseski, E., Pugeault, N., Kragic, D., et al. (2010). A strategy for grasping unknown objects based on co-planarity and colour information. RAS, 58, 551–565.
Qiao, L., Chen, S., & Tan, X. (2010). Sparsity preserving projections with applications to face recognition. Pattern Recognition, 43, 331–341.
Rasolzadeh, B., Björkman, M., Huebner, K., & Kragic, D. (2010). An active vision system for detecting, fixating and manipulating objects in the real world. IJRR, 29, 133–154.
Savarese, S., Fei-Fei, L. (2007) 3d generic object categorization, localization and pose estimation. ICCV (pp. 1–8).
Saxena, A., Driemeyer, J., Kearns, J., Osondu, C., Ng, A. (2008). Learning to grasp novel objects using vision. In Experimental Robotics, (pp. 33–42) Berlin: Springer .
Saxena, A., Driemeyer, J., Kearns, J., & Ng, A. (2006). Robotic grasping of novel objects. Neural Information Processing Systems, 19, 1209–1216.
Saxena, A., Wong, L., Quigley, M., & Ng, A. Y. (2011). A vision-based system for grasping novel objects in cluttered environments. Robotics Research, 18, 337–348.
Schölkopf, B., Smola, A.J., Müller, K.R. (1997). (pp. 583–588). Kernel principal component analysis. ICANN.
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels : support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press.
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. PAMI, 22, 888–905.
Shubina, K., & Tsotsos, J. (2010). Visual search for an object in a 3d environment using a mobile robot. CVIU, 114, 535–547.
Srinivasa, S., Ferguson, D., Helfrich, C., Berenson, D., Collet, A., Diankov, R., et al. (2010). Herb: a home exploring robotic butler. Autonomous Robots, 28, 5–20.
Torabi, L., & Gupta, K. (2012). An autonomous six-dof eye-in-hand system for in situ 3d object modeling. IJRR, 31, 82–100.
Tsaig, Y., & Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on Information Theory, 52, 1289–1306.
Tsoli, A., Jenkins, O. (2008). Neighborhood denoising for learning high-dimensional grasping manifolds. IROS (pp. 3680–3685).
Viksten, F., Forssen, P., Johansson, B., Moe, A. (2009). Comparison of local image descriptors for full 6 degree-of-freedom pose estimation. ICRA (pp. 2779–2786).
Vonikakis, V., Kouskouridas, R., Gasteratos, A. (2013). A comparison framework for the evaluation of illumination compensation algorithms. IST 2013 IEEE International Conference on. (pp. 264– 268).
Wang, J., Sun, X., Liu, P., She, M. F., & Kong, L. (2013). Sparse representation of local spatial-temporal features with dimensionality reduction for motion recognition. Neurocomputing, 100, 134–143.
Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., & Ma, Y. (2009). Robust face recognition via sparse representation. PAMI, 31, 210–227.
Yan, S., Xu, D., Zhang, B., Zhang, H. J., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: A general framework for dimensionality reduction. PAMI, 29, 40–51.
Yuan, C., & Niemann, H. (2001). Neural networks for the recognition and pose estimation of 3d objects from a single 2d perspective view. IMAVIS, 19, 585–592.
Zou, H., Hastie, T., & Tibshirani, R. (2004). Sparse principal component analysis. JCGS, 15, 2006.
Acknowledgments
The authors would like to thank Nikolaos Metaxas-Mariatos for his help in conducting the experimental validation of the proposed method.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Formulation of bunch-based architecture
Given an image of an object with certain pose, we first extract the 2D locations of the \(\rho \) SIFT keypoints denoted as \(\mathbf v _{\varvec{\zeta }}\in \mathbb {R}^2\). Then we perform clustering over the locations of the extracted interest points (input vectors \(\mathbf v _{\varvec{\zeta }},\, \zeta =1,2,\dots ,\rho \)) in order to account for the topological attributes of the object. Supposing there are \(\gamma \) clusters denoted as \(\mathbf b _\mathbf k ,\,k=1,2,\dots \gamma \), we consider basic Bayesian rules noting that a vector \(\mathbf v _{\varvec{\zeta }}\) belongs to a cluster \(\mathbf b _\mathbf k \) if \(P(\mathbf b _\mathbf k |\mathbf v _{\varvec{\zeta }})>P(\mathbf b \varvec{_\zeta }|\mathbf v _\mathbf j ),\,\zeta ,k=1,2,\dots ,\gamma ,\,\zeta \ne j\). The expectation of the unknown parameters conditioned on the current estimates \({\varvec{\varTheta }} (\tau )\) (\(\tau \) denotes the iteration step) and the training samples (E-step of the EM algorithm) are:
with \(\mathbf P _{1\times \gamma }=[P_1,\dots ,P_\gamma ]^T\) denoting the a priori probability of the respective clusters, \(\widehat{\varvec{\theta }_{2\times \gamma }}=[\varvec{\theta }_1^T, \dots , \varvec{\theta }\gamma ^T]^T\) corresponding to the \(\varvec{\theta _k}\) vector of parameters for the \(k-th\) cluster and \({\varvec{\varTheta }}_{3\times \gamma }=[\hat{\varvec{\theta }^T},\mathbf P ^T]^T\). According to M-step of the EM algorithm, the parameters of the \(\gamma \) clusters in the respective subspace are estimated through the maximization of the expectation:
resulting in
while maximization with respect to the a priori probability P gives:
It is apparent that the optimization of Eq. 8 with respect to P stands for a constraint maximization problem that has to obey to \(P_k\ge 0,\,k=1,\dots ,\gamma \) and \(\sum _{k=1}^{\gamma }P_k=1\). We revise the Lagrangian theory that states:
Given a function \(f(x)\) to be optimized subject to several constraints built the corresponding Lagrangian function as \(\mathcal {L}(x,\lambda )=f(x)-\sum \lambda f(x)\).
Following on from the above statement, we denote the respective (to Eq. 6) Lagrangian function as:
We obtain \(\lambda \) and \(P_k\) though:
1.2 Training with noise
The performance of the proposed regressor-based 3D pose estimation module is bootstrapped by adding noise to the input vectors fed to the RBF-kernel during training. In the following passage we present a slightly modified version of the Tikhonov regularization theorem as adjusted to the needs of our case. In cases where the inputs do not contain noise and the size \(\mathbf t \) of the training dataset tends to infinity, the error function containing the joint distributions \(p(\mathbf y \varvec{_\lambda },\mathbf r )\) (of the desired values for the network output \(\mathbf g \varvec{_\lambda }\)) assumes the form:
Let \(\varvec{\eta }\) be a random vector describing the input data with probability distribution \(p(\varvec{\eta })\). In most of the cases, noise distribution is chosen to have zero mean (\(\int \varvec{\eta }_ip(\varvec{\eta })d\varvec{\eta }=0\)) and to be uncorrelated (\(\int \varvec{\eta }_i\varvec{\eta }_jp(\varvec{\eta })d\varvec{\eta }=\text {variance}\sigma _{ij}\)). In cases where each input data point contains additional noise and is repeated infinite times, the error function over the expanded data can be written as:
Expanding the network function as a Taylor series in powers of \(\eta \) produces:
By substituting the Taylor series expansion into the error function we obtain the following form of regularization term that governs the Tikhonov regularization:
with
Rights and permissions
About this article
Cite this article
Kouskouridas, R., Charalampous, K. & Gasteratos, A. Sparse pose manifolds. Auton Robot 37, 191–207 (2014). https://doi.org/10.1007/s10514-014-9388-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10514-014-9388-x