Abstract
Referring Expression Comprehension (REC) is a task that requires to indicate particular objects within an image by natural language expressions. Previous studies on this task have assumed that the language expression and the image are one-to-one correspondence, that is, the language refers to the target region must exist in the current image and then the region with the highest score will be located, no matter whether they match or not. However, in practical applications, REC is required to locate the reference target region from a series of matched, semi-matched and mismatched scene image sequences. It is the 3D version of this challenge that refers to as Scenario Referring Expression Comprehension (SREC) in this paper. To accomplish such a task, we made a testset based on the existing real-scenario dataset enhancement, constructed a Dual Attributes Recursive Retrieve Reasoning Model (DA3R) for the first time with the Attributes of both images and expressions, and finally verified the feasibility of the method on the testset by assess with three different types of enhanced expression.
The first author of this paper is a student. This work was supported by the National Natural Science Foundation of China (No. 61373104, No. 61405143); the Excellent Science and Technology Enterprise Specialist Project of Tianjin (No. 18JCTPJC59000) and the Tianjin Natural Science Foundation (No. 16JCYBJC42300).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ammirato, P., Poirson, P., Park, E., Košecká, J., Berg, A.C.: A dataset for developing and benchmarking active vision, February 2017. https://doi.org/10.1109/ICRA.2017.7989164
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments, November 2017
Antol, S., et al.: VQA: visual question answering. CoRR abs/1505.00468 (2015). http://arxiv.org/abs/1505.00468
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: ImageNet: a large-scale hierarchical image database, pp. 248–255, June 2009. https://doi.org/10.1109/CVPR.2009.5206848
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes challenge 2007 (VOC2007) results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval, November 2015
Huang, Y., Wang, W., Wang, L.: Instance-aware image and sentence matching with selective multimodal LSTM, pp. 7254–7262, July 2017. https://doi.org/10.1109/CVPR.2017.767
Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 792–807. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_48
Liu, J., Wang, L., Yang, M.H.: Referring expression generation and comprehension via attributes. In: The IEEE International Conference on Computer Vision (ICCV), October 2017
Luo, R., Shakhnarovich, G.: Comprehension-guided referring expressions, January 2017
Ma, L., Lu, Z., Shang, L., Li, H.: Multimodal convolutional neural networks for matching image and sentence. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2623–2631, December 2015. https://doi.org/10.1109/ICCV.2015.301
Mahendru, A., Prabhu, V., Mohapatra, A., Batra, D., Lee, S.: The promise of premise: harnessing question premises in visual question answering, May 2017
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions, pp. 11–20, June 2016. https://doi.org/10.1109/CVPR.2016.9
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR 2013, January 2013
Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv (2018)
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 817–834. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_49
Savva, M., Chang, A., Dosovitskiy, A., Funkhouser, T., Koltun, V.: MINOS: multimodal indoor simulator for navigation in complex environments, December 2017
Schuster, S., Krishna, R., Chang, A., Li, F.F., Manning, C.D.: Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: Workshop on Vision & Language (2015)
Wu, Q., Shen, C., Wang, P., Dick, A., van den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 40(06), 1367–1381 (2018). https://doi.org/10.1109/TPAMI.2017.2708709
Wu, Q., Shen, C., van den Hengel, A., Liu, L., Dick, A.R.: Image captioning with an intermediate attributes layer. CoRR abs/1506.01144 (2015). http://arxiv.org/abs/1506.01144
Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension, pp. 1307–1315, June 2018. https://doi.org/10.1109/CVPR.2018.00142
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 69–85. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning, September 2016
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Wei, S., Wang, J., Sun, Y., Jin, G., Liang, J., Liu, K. (2019). Scenario Referring Expression Comprehension via Attributes of Vision and Language. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2019. Lecture Notes in Computer Science(), vol 11859. Springer, Cham. https://doi.org/10.1007/978-3-030-31726-3_37
Download citation
DOI: https://doi.org/10.1007/978-3-030-31726-3_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-31725-6
Online ISBN: 978-3-030-31726-3
eBook Packages: Computer ScienceComputer Science (R0)