Context-aware 3D object anchoring for mobile robots
Introduction
Mobile robots need to create and maintain internal representations of their surroundings for planning and executing tasks involving elements in them. Traditional tasks like navigation or localization have already well-suited solutions for building such representations, for creating metric [1], [2], topological [3], [4] or hybrid maps [5], [6]. More sophisticated world models arose for dealing with higher-level tasks, called semantic maps [7], [8], [9], which codify information from the exploration of the environment, but also consider semantic knowledge (or meta information) about the elements that can be found in the robot workspace, their properties, and their relations. Unlike the metric and topological maps case, the algorithms for building and exploiting these models are not that well-defined, and there is still significant room to explore.
One of the most critical steps during the building of these maps is creating and maintaining the correspondence between the object percepts detected in the workspace and their conceptual representation in the world model. This is known as the anchoring problem: “We call anchoring the process of creating and maintaining the correspondence between symbols and sensor data that refer to the same physical objects. The anchoring problem is the problem of how to perform anchoring in an artificial system”. [10, p. 86f]. If we move our discourse away from semantic maps, anchoring would be still necessary for any system pursuing a plan-based robot control, where symbols are just object identifiers or labels used by the planner. However, notice the potential of semantic maps where symbols are further linked to concepts codifying functionalities and relations.
The anchoring process links object percepts to their conceptually defined categories, e.g.: spoon, knife, mug, etc. This linking is accomplished by an object recognition method, which assigns a category to the percept. The kind of recognition algorithms traditionally exploited for this aim are referred to as local object recognition methods within this paper, since they work by individually classifying the perceived percepts according to their local features, e.g., size, shape, appearance, etc. [11]. However, it is well-known that local recognition methods are prone to provide ambiguous results [12], [13], [14], which can lead to wrong linkings in the anchoring process. This results in a incoherent world model and in failing task executions.
In this work we contribute an anchoring system with a distinctive feature: it does not simply copy the classification results from a local object recognition system, but instead uses the relations between objects (their spatial context) to improve those classification results. This is motivated by the fact that objects rarely occur in independent configurations at identically distributed locations. Rather, there is a coherent structure inherent to most real-world scenes. For example, a longish object in front of a monitor has a high probability of being a keyboard, whereas an object with the same local appearance next to a bread knife is more likely a cutting board.
In cases where local appearance features are not sufficient, contextual features can disambiguate object appearance in object recognition tasks [14]. Jointly considering context-based object categorization and anchoring as parts of a context-aware anchoring system benefits both subproblems: Anchoring receives better and more stable object classification results, while context-based object categorization can use contextual relations with anchored objects that extend beyond the sensor aperture.
As an example of this, consider Fig. 1. Due to the position of the robot and the aperture of the RGB-D camera, only part of the table scene is visible in the current sensor frame, so the full context is not available. This poses a problem for most existing context-aware object recognition systems, which fall roughly into one of the following two categories. The first category is single-frame recognition systems, which recognize objects relying on single observations of the scene in the form of RGB, depth or RGB-D images [15], [16], [17], [18], [19], [20]. Regarding the exploitation of contextual information, single-frame systems are seriously limited by the sensor aperture and occlusions, given that they are able to observe only a portion of the objects and relations appearing in the whole scene. The second category are offline recognition systems, which register a number of observations prior to the recognition process in order to obtain a wider view of the scene [21], [22], [23], [24], [25], [26], [27], [28], [29]. This solves the problems caused by sensor aperture and occlusions; however, the need to finish recording the sensor data before running the object recognition process prevents online operation, which is a requirement for most plan-based robot control systems.
Our approach is to continually process single frames using a local object recognition method and integrate the object recognition results into a persistent probabilistic world model. We then use a Conditional Random Field (CRF) [30] to exploit contextual relations between objects in the current scene as well as relations with previously perceived objects from the world model to improve the recognition results.
This approach has the following advantages:
- •
Our system can exploit contextual relations with objects that are currently out of view while still being capable of online operation (meaning that the output of the system is updated as soon as new sensor data comes in).
- •
The world model allows us to consistently assign the same object ID to an object without requiring that the object be constantly tracked. By anchoring symbolic object IDs to the objects reported by the local object recognition method, the planner and plan executor can refer to an object by the same symbol even after the object has disappeared from view for a prolonged period.
- •
The proposed system can integrate any state-of-the-art local object recognition system that processes single frames and supplement it with additional context information, achieving a significant boost in classification accuracy at a low computational overhead.
The next section relates our work to the state of the art. Section 3 describes the role of our system within the RACE project. In Section 4, we describe the proposed context-aware anchoring system, and experimental results are reported in Section 5. Finally, Section 6 contains the conclusions and possible future work.
Section snippets
State of the art
This section starts with the discussion of related works concerning the two traditional ways to exploit contextual relations for object recognition: based on (offline) full-scene (Section 2.1), or on single-frame processing (Section 2.2). Next, the relation between the presented work and the emerging field of Dense 3D Semantic Mapping (Section 2.3) is discussed. Finally, a number of relevant works addressing anchoring and world modeling are reported (see Section 2.4).
Anchoring in the RACE architecture
The anchoring system presented here has been successfully employed in the context of the RACE project [64]. In RACE, all high-level modules communicate via the so-called blackboard, which is implemented on top of an RDF database. The elements stored on the blackboard are called fluents, i.e., temporally valid ground facts of a Description Logic (DL) ontology. Fluents have both a start and finish time between which they are valid. Since the main objective of the RACE project was enabling a robot
Context-aware anchoring
The proposed context-aware anchoring system is a combination of: (i) a local object recognition method, (ii) an anchoring process, and (iii) a Conditional Random Field (see Fig. 3). The local object recognition method1 (Section 4.1)
System evaluation
To evaluate our system, we collected a dataset of 15 different scenes. After a description of the dataset and the cross validation scheme, we first present a separate evaluation of the object association part of the system (Section 5.3). This is followed by an evaluation of the complete system (Section 5.4).
Conclusions
This work has presented an anchoring system able to exploit the contextual relations of the objects in the scene to achieve more coherent and stable results, aiming to build suitable representations of the robot workspace. This is a critical aspect of these systems when running in robotic architectures, since a wrong linking between an object and its category would end up with failures, for example, during task planing or execution. To achieve that we rely on a probabilistic, context-based
Acknowledgments
This work is supported by the European projects RACE, Germany (FP7-ICT-2011-7, grant agreement number 287752) and MoveCare, Italy (H2020-ICT-2016-1, grant agreement number732158), by the WISER project (reference DPI2017-84827-R) funded by the Spanish Government, Spain and financed by European Regional Development’s funds (FEDER), Spain, and by a postdoc contract from the I-PPIT-UMA program financed by the University of Málaga, Spain .
Martin Günther is a researcher at DFKI, the German Research Center for Artificial Intelligence. He earned his Diploma in Computer Science from Technical University Dresden, Germany, in 2008. From 2009 until 2015 he worked as a research associate at the Knowledge-Based Systems group at Osnabrück University, Germany. His research interests include 3D perception, semantic mapping, context-aware object tracking and anchoring, active perception and goal-directed action in autonomous robot control.
References (80)
- et al.
Towards a general theory of topological maps
Artificial Intelligence
(2004) Learning metric-topological maps for indoor mobile robot navigation
Artificial Intelligence
(1998)- et al.
Subjective local maps for hybrid metric-topological SLAM
Robot. Auton. Syst.
(2009) - et al.
Semantic mapping for mobile robotics tasks: A survey
Robot. Auton. Syst.
(2015) - et al.
Living with robots: Interactive environmental knowledge acquisition
Robot. Auton. Syst.
(2016) - et al.
Model-based furniture recognition for building semantic object maps
Artificial Intelligence
(2017) - et al.
An introduction to the anchoring problem
Robot. Auton. Syst.
(2003) - et al.
The role of context in object recognition
Trends Cogn. Sci.
(2007) - et al.
Context based object categorization: A critical survey
Comput. Vis. Image Underst.
(2010) - et al.
Exploiting semantic knowledge for robot object recognition
Knowl.-Based Syst.
(2015)
A survey on learning approaches for undirected graphical models. Application to scene object recognition
Internat. J. Approx. Reason.
Semantic world modeling using probabilistic multiple hypothesis anchoring
Robot. Auton. Syst.
Building multiversal semantic maps for mobile robot operation
Knowl.-Based Syst.
Avoiding non-independence in fMRI data analysis: Leave one subject out
NeuroImage
Sonar-based real-world mapping and navigation
IEEE J. Robot. Automat.
Learning occupancy grid maps with forward sensor models
Auton. Robot.
Bayesian inference in the space of topological maps
IEEE T. Robot.
Local features and kernels for classification of texture and object categories: A comprehensive study
Int. J. Comput. Vision
An empirical study of context in object detection
RGB-(D) scene labeling: Features and algorithms
Semantic context modeling with maximal margin conditional random fields for automatic image annotation
Context-based vision system for place and object recognition
Putting objects in perspective
Int. J. Comput. Vision
Object categorization using co-occurrence, location and appearance
TextonBoost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context
Int. J. Comput. Vision
Semantic labeling of 3D point clouds for indoor scenes
Contextually guided semantic labeling and search for three-dimensional point clouds
Int. J. Robot. Res.
Using context to create semantic 3D models of indoor environments
Mesh based semantic modelling for indoor and outdoor scenes
Mobile robot object recognition through the synergy of probabilistic graphical models and semantic knowledge
Relational approaches for joint object classification and scene similarity measurement in indoor environments
Combining top-down spatial reasoning and bottom-up object class recognition for scene understanding
A comparison of qualitative and metric spatial relation models for scene understanding
Probabilistic Graphical Models: Principles and techniques
Knowledge-based incremental bayesian learning for object recognition
A large-scale hierarchical multi-view RGB-D object dataset
Convolutional-recursive deep learning for 3D object classification
Multimodal deep learning for robust RGB-D object recognition
SUN RGB-D: A RGB-D scene understanding benchmark suite
Indoor segmentation and support inference from RGBD images
Cited by (13)
A survey of Semantic Reasoning frameworks for robotic systems
2023, Robotics and Autonomous SystemsCitation Excerpt :Second, the use of cliques to define a joint distribution makes parameterizing the model by hand more difficult. The second limitation leads to the convention of learning clique potentials from training examples, which requires apriori data to converge to reasonable parameters [109–111]. Uses and Applications: Markov networks have been used to model spatial and contextual relations between objects.
Ontology-based conditional random fields for object recognition
2019, Knowledge-Based SystemsCitation Excerpt :This approach could be extended by integrating the CNN scores in the unary factors instead of replacing them, which permits the CRF to model complementary higher-level features not computed by CNNs while keeping its complexity low. This promising idea was explored in [64,65] in the context of object recognition by a mobile robot, but using an off-the-shelf model instead of a CNN. The performance of obCRFs has been tested within theRobot@Home and Cornell-RGBD datasets, which are briefly introduced in Section 6.1.
Aligning Robot and Human Representations
2023, arXivObject Anchoring for Autonomous Robots Using the Spatio-Temporal-Semantic Environment Representation SEEREP
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Martin Günther is a researcher at DFKI, the German Research Center for Artificial Intelligence. He earned his Diploma in Computer Science from Technical University Dresden, Germany, in 2008. From 2009 until 2015 he worked as a research associate at the Knowledge-Based Systems group at Osnabrück University, Germany. His research interests include 3D perception, semantic mapping, context-aware object tracking and anchoring, active perception and goal-directed action in autonomous robot control.
Jose-Raul Ruiz-Sarmiento received the B.Sc in “Computer Science Engineering” from the University of Málaga in July 2009, one year later obtained the M.Sc in “Mechatronics”, and in November 2016 completed his Ph.D. As part of such Ph.D., in 2014 he was a visitor in the Knowledge-Based Systems Research Group, at Osnabrück Universtiy, Germany. Since he joined the MAPIR group in September 2008, he has been involved in different national and European projects, and has developed a number of open-source tools related to its research lines: object and room recognition, semantic mapping, and machine learning, all in the scope of robotics. His research activity has produced more than 20 publications.
Cipriano Galindo received the Ph.D. degree (2006) in Computer Science from the University of Málaga, Spain. Since 2009 he is full time associate professor at the same University. In 2004–2005 he was at the Applied Autonomous Sensor Systems, Örebro University (Sweden), working on semantic maps and intelligent systems. His research focuses on service robotics, telepresence, and Quality of Life Technologies (QoLT), being (co)author of 20 JCR papers, 38 international conferences and 2 books.
Javier Gonzalez-Jimenez received the B.S. degree in Electrical Engineering from the University of Seville in 1987. Then, he joined the Department of “Ingenieria de Sistemas y Automatica” at the University of Málaga in 1988 and received the Ph.D. from this University in 1993. In 1990–1991 he was at the Field Robotics Center, Robotics Institute, Carnegie Mellon University (USA) working on mobile robots as part of his Ph.D. Since 1996 he has been leading Spanish and European projects on mobile robotics and perception. Currently, he is the head of the MAPIR group and full professor at the University of Málaga. His research interests include mobile robot autonomous navigation, computer vision and robotic olfaction. In these fields he has published three books and more than 200 papers.
Joachim Hertzberg is a full professor for computer science at Osnabrück University, Germany, heading the Knowledge-Based Systems lab. Since 2011, he is also head of the Osnabrück branch of the Robotics Innovation Center of the German Research Center for Artificial Intelligence (DFKI). He has graduated in Computer Science (diploma U. Bonn, 1982; Dr.rer.nat. 1986, U. Bonn; habilitation 1995, U. Hamburg). Former affiliations include GMD and Fraunhofer AIS in Sankt Augustin. His areas of research are AI and Mobile Robotics, with contributions to action planning, plan-based robot control, sensor data interpretation, semantic mapping, reasoning about action, constraint-based reasoning, and various applications of these. In his research fields, he has been the PI in a number of national and European projects. At Osnabrück University, he served as the Dean of the School of Mathematics and Computer Science. Awards for his work include the EurAI fellowship received in 2014.