Affordance-based robot object retrieval

Nguyen, Thao; Gopalan, Nakul; Patel, Roma; Corsaro, Matt; Pavlick, Ellie; Tellex, Stefanie

doi:10.1007/s10514-021-10008-7

Affordance-based robot object retrieval

Published: 30 August 2021

Volume 46, pages 83–98, (2022)
Cite this article

Autonomous Robots Aims and scope Submit manuscript

Thao Nguyen ORCID: orcid.org/0000-0002-5016-2230¹,
Nakul Gopalan^1,2,
Roma Patel¹,
Matt Corsaro¹,
Ellie Pavlick¹ &
…
Stefanie Tellex¹

779 Accesses
3 Citations
Explore all metrics

Abstract

Natural language object retrieval is a highly useful yet challenging task for robots in human-centric environments. Previous work has primarily focused on commands specifying the desired object’s type such as “scissors” and/or visual attributes such as “red,” thus limiting the robot to only known object classes. We develop a model to retrieve objects based on descriptions of their usage. The model takes in a language command containing a verb, for example “Hand me something to cut,” and RGB images of candidate objects; and outputs the object that best satisfies the task specified by the verb. Our model directly predicts an object’s appearance from the object’s use specified by a verb phrase, without needing an object’s class label. Based on contextual information present in the language commands, our model can generalize to unseen object classes and unknown nouns in the commands. Our model correctly selects objects out of sets of five candidates to fulfill natural language commands, and achieves a mean reciprocal rank of 77.4% on a held-out test set of unseen ImageNet object classes and 69.1% on unseen object classes and unknown nouns. Our model also achieves a mean reciprocal rank of 71.8% on unseen YCB object classes, which have a different image distribution from ImageNet. We demonstrate our model on a KUKA LBR iiwa robot arm, enabling the robot to retrieve objects based on natural language descriptions of their usage (Video recordings of the robot demonstrations can be found at https://youtu.be/WMAdGhMmXEQ). We also present a new dataset of 655 verb-object pairs denoting object usage over 50 verbs and 216 object classes (The dataset and code for the project can be found at https://github.com/Thaonguyen3095/affordance-language).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Unknown Groundings for Natural Language Interaction with Mobile Robots

Learning instance-level N-ary semantic knowledge at scale for robots operating in everyday environments

Article 06 April 2023

V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language

Notes

https://youtu.be/WMAdGhMmXEQ.
We did not use the COCO dataset as our training image set as it has a much smaller number of object classes in comparison to ImageNet’s 1000 object classes.

References

Calli, B.,Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., and Dollar, A.,M. (2015). The YCB object and Model set: Towards common benchmarks for manipulation research. In Proceedings of the IEEE international conference on advanced robotics, pp. 510–517.
Chao, Y-W., Wang Z., Mihalcea R., and Deng, J. (2015). Mining semantic affordances of visual object categories. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4259–4267.
Chen, K., Choy, C. B., Savva, M., Chang, A. X., Funkhouser, T and Savarese, S. (2018). Text2Shape: Generating shapes from natural language by learning joint embeddings. In Asian conference on computer vision, pp. 100–116. Springer.
Cohen, V., Burchfiel, B., Nguyen, T., Gopalan, N., Tellex, S., and Konidaris, G. (2019). Grounding language attributes to objects using Bayesian Eigen objects. In Proceedings of the IEEE international conference on intelligent robots and systems.
Do, T-T., Nguyen, A., and Reid, I. (2018). AffordanceNet: An end-to-end deep learning approach for object affordance detection. In Proceedings of the IEEE international conference on robotics and automation, pp. 1–5.
Elman, Jeffrey L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211. https://doi.org/10.1207/s15516709cog1402_1.
Article Google Scholar
Fulda, N., Ricks, D., Murdoch, B., and Wingate D. (2017). What can you do with a rock? affordance extraction via word embeddings. arXiv preprint arXiv:1703.03429.
Hatori, J., Kikuchi, Y., Kobayashi, S., Takahashi, K., Tsuboi, Y., Unno, Y., Ko, W., and Tan, J. (2018). Interactively picking real-world objects with unconstrained spoken language instructions. In Proceedings of the IEEE international conference on robotics and automation, pp. 3774–3781.
He, K., Gkioxari, G., Dollár, P., and Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
Honnibal, M and Montani, I (2017). spaCy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear.
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., and Darrell, T. (2016). Natural language object retrieval. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4555–4564.
James, J. G. (1977). The theory of affordances. USA: Hilldale.
Google Scholar
Joseph, L. F. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378.
Article Google Scholar
Kelvin, X., Jimmy, B., Ryan, K., Kyunghyun, C., Aaron, C., Ruslan, S., Rich, Z and Yoshua, B. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057.
Kingma, D. P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Krishnamurthy, J., & Kollar, T. (2013). Jointly learning to parse and perceive: Connecting natural language to the physical world. Transactions of the Association for Computational Linguistics, 1, 193–206.
Article Google Scholar
Lin, T-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Lawrence, Z. C. (2014). Microsoft COCO: Common objects in context. In European conference on computer vision, pp. 740–755. Springer.
Mallick, A., Pobil, A. P. D., and Cervera, E. (2018). Deep learning based object recognition for robot picking task. In Proceedings of the 12th international conference on ubiquitous information management and communication, pp. 1–9.
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., and Yuille, A. (2014). Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632.
Mikolov, T., Chen, K., Corrado, G., and Dean, J., (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Myers, A., Teo, C. L., Fermüller, C., and Aloimonos, Y. (2015). Affordance detection of tool parts from geometric features. In Proceedings of the IEEE international conference on robotics and automation, pp. 1374–1381.
Nguyen, T., Gopalan, N., Patel, R., Corsaro, M., Pavlick, E., and Tellex, S. (2020). Robot object retrieval with contextual natural language queries. In Proceedings of robotics: Science and systems, Corvalis, Oregon, USA. https://doi.org/10.15607/RSS.2020.XVI.080.
Patterson, G., and Hays, J. (2012). SUN Attribute database: Discovering, annotating, and recognizing scene attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2751–2758.
Pennington, J., Socher, R., and Manning, C. D. (2014) GloVe: Global vectors for word representation. In Proceedings of the conference on empirical methods in natural language processing, pp. 1532–1543.
Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y.
Article MathSciNet Google Scholar
Schlangen, D., Zarriess, S., and Kennington, C., (2016). Resolving references to objects in photographs using the words-as-classifiers model. In Proceedings of the 54th annual meeting of the association for computational linguistics, pp. 1213–1223. ISBN 9781510827585. http://arxiv.org/abs/1510.02125.
Shridhar, M., and Hsu, D. (2018). Interactive visual grounding of referring expressions for human-robot interaction. arXiv preprint arXiv:1806.03831.
Speer, R., Chin, J., and Havasi, C. (2017). ConceptNet 5.5: An open multilingual graph of general knowledge. In Thirty-First AAAI conference on artificial intelligence.
Tan, M., and Le, Q. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pp. 6105–6114. PMLR.
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A Neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164.
Whitney, D., Rosen, E., MacGlashan, J., Lawson L.S.W, and Stefanie, T. (2017). Reducing errors in object-fetching interactions through social feedback. In Proceedings of the IEEE international conference on robotics and automation, pp. 1006–1013.

Download references

Acknowledgements

The authors would like to thank Prof. James Tompkin for advice on selecting the image dataset and encoder, and Eric Rosen for help with video editing. This work is supported by the National Science Foundation under award numbers IIS-1652561 and IIS-1717569, NASA under award number NNX16AR61G, and with support from the Hyundai NGV under the Hyundai-Brown Idea Incubation award and the Alfred P. Sloan Foundation.

Author information

Authors and Affiliations

Brown University, Providence, USA
Thao Nguyen, Nakul Gopalan, Roma Patel, Matt Corsaro, Ellie Pavlick & Stefanie Tellex
Georgia Institute of Technology, Atlanta, USA
Nakul Gopalan

Authors

Thao Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Nakul Gopalan
View author publications
You can also search for this author in PubMed Google Scholar
Roma Patel
View author publications
You can also search for this author in PubMed Google Scholar
Matt Corsaro
View author publications
You can also search for this author in PubMed Google Scholar
Ellie Pavlick
View author publications
You can also search for this author in PubMed Google Scholar
Stefanie Tellex
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thao Nguyen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This is one of the several papers published in Autonomous Robots comprising the Special Issue on Robotics: Science and Systems 2020.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, T., Gopalan, N., Patel, R. et al. Affordance-based robot object retrieval. Auton Robot 46, 83–98 (2022). https://doi.org/10.1007/s10514-021-10008-7

Download citation

Received: 31 January 2021
Accepted: 10 July 2021
Published: 30 August 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s10514-021-10008-7

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Affordance-based robot object retrieval

Abstract

Access this article

Similar content being viewed by others

Learning Unknown Groundings for Natural Language Interaction with Mobile Robots

Learning instance-level N-ary semantic knowledge at scale for robots operating in everyday environments

V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Navigation

Affordance-based robot object retrieval

Abstract

Access this article

Similar content being viewed by others

Learning Unknown Groundings for Natural Language Interaction with Mobile Robots

Learning instance-level N-ary semantic knowledge at scale for robots operating in everyday environments

V2A - Vision to Action: Learning Robotic Arm Actions Based on Vision and Language

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation