Skip to main content

Scenario Referring Expression Comprehension via Attributes of Vision and Language

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2019)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11859))

Included in the following conference series:

  • 1920 Accesses

Abstract

Referring Expression Comprehension (REC) is a task that requires to indicate particular objects within an image by natural language expressions. Previous studies on this task have assumed that the language expression and the image are one-to-one correspondence, that is, the language refers to the target region must exist in the current image and then the region with the highest score will be located, no matter whether they match or not. However, in practical applications, REC is required to locate the reference target region from a series of matched, semi-matched and mismatched scene image sequences. It is the 3D version of this challenge that refers to as Scenario Referring Expression Comprehension (SREC) in this paper. To accomplish such a task, we made a testset based on the existing real-scenario dataset enhancement, constructed a Dual Attributes Recursive Retrieve Reasoning Model (DA3R) for the first time with the Attributes of both images and expressions, and finally verified the feasibility of the method on the testset by assess with three different types of enhanced expression.

The first author of this paper is a student. This work was supported by the National Natural Science Foundation of China (No. 61373104, No. 61405143); the Excellent Science and Technology Enterprise Specialist Project of Tianjin (No. 18JCTPJC59000) and the Tianjin Natural Science Foundation (No. 16JCYBJC42300).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ammirato, P., Poirson, P., Park, E., Košecká, J., Berg, A.C.: A dataset for developing and benchmarking active vision, February 2017. https://doi.org/10.1109/ICRA.2017.7989164

  2. Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24

    Chapter  Google Scholar 

  3. Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments, November 2017

    Google Scholar 

  4. Antol, S., et al.: VQA: visual question answering. CoRR abs/1505.00468 (2015). http://arxiv.org/abs/1505.00468

  5. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: ImageNet: a large-scale hierarchical image database, pp. 248–255, June 2009. https://doi.org/10.1109/CVPR.2009.5206848

  6. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2007 (VOC2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html

  7. Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Google Scholar 

  8. Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval, November 2015

    Google Scholar 

  9. Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal LSTM, pp. 7254–7262, July 2017. https://doi.org/10.1109/CVPR.2017.767

  10. Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48

    Chapter  Google Scholar 

  11. Liu, J., Wang, L., Yang, M.H.: Referring expression generation and comprehension via attributes. In: The IEEE International Conference on Computer Vision (ICCV), October 2017

    Google Scholar 

  12. Luo, R., Shakhnarovich, G.: Comprehension-guided referring expressions, January 2017

    Google Scholar 

  13. Ma, L., Lu, Z., Shang, L., Li, H.: Multimodal convolutional neural networks for matching image and sentence. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2623–2631, December 2015. https://doi.org/10.1109/ICCV.2015.301

  14. Mahendru, A., Prabhu, V., Mohapatra, A., Batra, D., Lee, S.: The promise of premise: harnessing question premises in visual question answering, May 2017

    Google Scholar 

  15. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions, pp. 11–20, June 2016. https://doi.org/10.1109/CVPR.2016.9

  16. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR 2013, January 2013

    Google Scholar 

  17. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv (2018)

    Google Scholar 

  18. Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_49

    Chapter  Google Scholar 

  19. Savva, M., Chang, A., Dosovitskiy, A., Funkhouser, T., Koltun, V.: MINOS: multimodal indoor simulator for navigation in complex environments, December 2017

    Google Scholar 

  20. Schuster, S., Krishna, R., Chang, A., Li, F.F., Manning, C.D.: Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: Workshop on Vision & Language (2015)

    Google Scholar 

  21. Wu, Q., Shen, C., Wang, P., Dick, A., van den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40(06), 1367–1381 (2018). https://doi.org/10.1109/TPAMI.2017.2708709

    Article  Google Scholar 

  22. Wu, Q., Shen, C., van den Hengel, A., Liu, L., Dick, A.R.: Image captioning with an intermediate attributes layer. CoRR abs/1506.01144 (2015). http://arxiv.org/abs/1506.01144

  23. Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension, pp. 1307–1315, June 2018. https://doi.org/10.1109/CVPR.2018.00142

  24. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5

    Chapter  Google Scholar 

  25. Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning, September 2016

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yukuan Sun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wei, S., Wang, J., Sun, Y., Jin, G., Liang, J., Liu, K. (2019). Scenario Referring Expression Comprehension via Attributes of Vision and Language. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2019. Lecture Notes in Computer Science(), vol 11859. Springer, Cham. https://doi.org/10.1007/978-3-030-31726-3_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-31726-3_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-31725-6

  • Online ISBN: 978-3-030-31726-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics