Abstract
One of the most basic functions of language is to refer to objects in a shared scene. Modeling reference with continuous representations is challenging because it requires individuation, i.e., tracking and distinguishing an arbitrary number of referents. We introduce a neural network model that, given a definite description and a set of objects represented by natural images, points to the intended object if the expression has a unique referent, or indicates a failure, if it does not. The model, directly trained on reference acts, is competitive with a pipeline manually engineered to perform the same task, both when referents are purely visual, and when they are characterized by a combination of visual and linguistic properties.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We ignore the thorny philosophical issues of reference, such as its relationship to reality. For an overview and references (no pun intended), see [3].
- 2.
For neural network design and training see, e.g., [6].
- 3.
We do not enter the determiner in the query, since it does not vary across data points: our setup is equivalent to always having “the” in the input. The network learns the intended semantics through training.
- 4.
- 5.
We use the MatConvNet toolkit, http://www.vlfeat.org/matconvnet/.
References
Russell, B.: On denoting. Mind 14, 479–493 (1905)
Harnad, S.: The symbol grounding problem. Physica D 42, 335–346 (1990)
Reimer, M., Michaelson, E.: Reference. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy. Winter 2014 edn. (2014)
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)
Bos, J., Clark, S., Steedman, M., Curran, J.R., Hockenmaier, J.: Wide-coverage semantic representations from a CCG parser. In: Proceedings of the COLING, Geneva, Switzerland, pp. 1240–1246 (2004)
Nielsen, M.: Neural Networks and Deep Learning. Determination Press, New York (2015). http://neuralnetworksanddeeplearning.com/
Frome, A., et al.: DeViSE: A deep visual-semantic embedding model. In: Proceedings of NIPS, Lake Tahoe, NV, pp. 2121–2129 (2013)
Lazaridou, A., Dinu, G., Baroni, M.: Hubness and pollution: delving into cross-space mapping for zero-shot learning. In: Proceedings of ACL, Beijing, China, pp. 270–280 (2015)
Weston, J., Bengio, S., Usunier, N.: WSABIE: scaling up to large vocabulary image annotation. In: Proceedings of IJCAI, Barcelona, Spain, pp. 2764–2770 (2011)
Lazaridou, A., Pham, N., Baroni, M.: Combining language and vision with a multimodal skip-gram model. In: Proceedings of NAACL, Denver, CO, pp. 153–163 (2015)
Baroni, M., Lenci, A.: Distributional memory: a general framework for corpus-based semantics. Comput. Linguist. 36, 673–721 (2010)
Brysbaert, M., Warriner, A.B., Kuperman, V.: Concreteness ratings for 40 thousand generally known English word lemmas. Behav. Res. Methods 46, 904–911 (2014)
Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of ACL, Baltimore, MD, pp. 238–247 (2014)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR Conference Track, San Diego, CA (2015). http://www.iclr.cc/doku.php?id=iclr2015:main
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR Conference Track, San Diego, CA (2015). http://www.iclr.cc/doku.php?id=iclr2015:main
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of ICML, Lille, France, pp. 2048–2057 (2015)
Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Proceedings of NIPS, Montreal, Canada, pp. 2692–2700 (2015)
Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks (2015). http://arxiv.org/abs/1503.08895
Weston, J., Chopra, S., Bordes, A.: Memory networks. In: Proceedings of ICLR Conference Track, San Diego, CA (2015). http://www.iclr.cc/doku.php?id=iclr2015:main
Gorniak, P., Roy, D.: Grounded semantic composition for visual scenes. J. Artif. Intell. Res. 21, 429–470 (2004)
Larsson, S.: Formal semantics for perceptual classification. J. Logic Comput. 25, 335–369 (2015)
Matuszek, C., Bo, L., Zettlemoyer, L., Fox, D.: Learning from unscripted deictic gesture and language for human-robot interactions. In: Proceedings of AAAI, Quebec City, Canada, pp. 2556–2563 (2014)
Steels, L., Belpaeme, T.: Coordinating perceptually grounded categories through language: a case study for colour. Behav. Brain Sci. 28, 469–529 (2005)
Kennington, C., Schlangen, D.: Simple learning and compositional application of perceptually grounded word meanings for incremental reference resolution. In: Proceedings of ACL, Beijing, China, pp. 292–301 (2015)
Krahmer, E., van Deemter, K.: Computational generation of referring expressions: a survey. Comput. Linguist. 38 (2012)
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: Proceedings of EMNLP, Doha, Qatar, pp. 787–798 (2014)
Tily, H., Piantadosi, S.: Refer efficiently: use less informative expressions for more predictable meanings. In: Proceedings of the CogSci Workshop on the Production of Referring Expressions, Amsterdam, The Netherlands (2009)
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of CVPR, Las Vegas, NV (2016) (in Press)
Geman, D., Geman, S., Hallonquist, N., Younes, L.: Visual turing test for computer vision systems. Proc. Nat. Acad. Sci. 112, 3618–3623 (2015)
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Proceedings of NIPS, Montreal, Canada, pp. 1682–1690 (2014)
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Proceedings of NIPS, Montreal, Canada (2015). https://papers.nips.cc/book/advances-in-neural-information-processing-systems-28-2015
Baroni, M.: Grounding distributional semantics in the visual world. Lang. Linguist. Compass 10, 3–13 (2016)
Abbott, B.: Reference. Oxford University Press, Oxford (2010)
Datta, R., Joshi, D., Li, J., Wang, J.: Image retrieval: ideas, influences, and trends of the new age. ACM Comput. Surv. 40, 1–60 (2008)
Acknowledgments
We are grateful to Elia Bruni for the CNN baseline idea, and to Angeliki Lazaridou for providing us with the visual vectors used in the paper. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 655577 (LOVe) and ERC grant agreement No 715154 (AMORE); ERC 2011 Starting Independent Research Grant n. 283554 (COMPOSES); DFG (SFB 732, Project D10); and Spanish MINECO (grant FFI2013-41301-P). This paper reflects the authors’ view only, and the EU is not responsible for any use that may be made of the information it contains.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Data Creation for the Object-Only Dataset (Experiment 1)
The process to generate a object sequence is shown in Algorithm 1. We start with an empty sequence and sample the length of the sequence uniformly at random from the permitted sequence lengths (l. 2). We fill the sequence with objects and images sampled uniformly at random (l. 4/5). We assume, without loss of generality, that the the object that we will query for, q, is the first one (l. 6). Then we sample whether the current sequence should be an anomaly (l. 7). If it should be a missing-anomaly (i.e., no matches for the query), we overwrite the target object and image with a new random draw from the pool (l. 9/10). If we decide to turn it into a multiple-anomaly (i.e., with multiple matches for the query), we randomly select another position in the sequence and overwrite it with the query object and a new image (l. 12/13). Finally, we shuffle the sequence so that the query is assigned a random position (l. 14).

B Data Creation for the Object+Attribute Dataset (Experiment 2)
Figure 5 shows the intuition for sampling the Object+Attribute dataset. Arrows indicate compatibility constraints in sampling. We start from the query pair (object 1 – attribute 1). Then we sample two more attributes that are both compatible with object 1. Finally, we sample two more objects that are compatible both with the original attribute 1 and one of the two attributes.
Algorithm 2 defines the sampling procedure formally. We sample the first triple randomly (l. 2). Then we sample two two compatible attributes for this object (l. 3), and one more object for each attribute (l. 4). This yields a set of six confounders (l. 5–10). After sampling the length of the final sequence l (l. 11), we build the sequence from the first triple and \(l-1\) confounders (l. 12–13), with the first triple as query (l. 14). The treatment of the anomalies is exactly as before.

C Statistics on the Datasets
Table 2 shows statistics on the dataset. The first line covers the Object-Only dataset. Objects occur on average 90 times in the train portion of Object-Only, specific images only twice; the numbers for the test set are commensurately lower. While all objects in the test set are seen during training, 23% of the images are not. Due to the creation by random sampling, a minimal number of sequences is repeated (5 sequences occur twice in the training set, 1 four times) and shared between training and validation set (1 sequence). All other sequences occur just once.
The second line covers the Object+Attribute dataset. The average frequencies for objects and object images mirror those in Object-Only quite closely. The new columns on object-attribute (O+A) and object-attribute-image (O+A+I) combinations show that object-attribute combinations occur relatively infrequently (each object is paired with many attributes) but that the combination is considerably restricted (almost no combinations are new in the test set). The full entity representations (object-attribute-image triples), however, are very infrequent (average frequency just above 1), and more than 80% of these are unseen in the test set. A single sequence occurs twice in the test set, all others once; one sequence is shared between train and test.
D Hyperparameter Tuning
We tuned the following hyperparameters on the Object-Only validation set and re-used them for Object+Attribute without further tuning (except for the Pipeline heuristics’ thresholds). Chosen values are given in parentheses.
-
PoP: multimodal embedding size (300), anomaly sensor size (100), nonlinearities \(\psi \) (relu) and \(\phi \) (sigmoid), learning rate (0.09), epoch count (14).
-
TRPoP: same settings, except epoch count (36).
-
Pipeline: multimodal embedding size (300), margin size (0.5), learning rate (0.09), maximum similarity threshold (0.1 for Object-Only, 0.4 for Object+Attribute), top-two similarity difference threshold (0.05 and 0.07).
Momentum was set to 0.09, learning rate decay to 1E-4 for all models, based on informal preliminary experimentation.
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Baroni, M., Boleda, G., Padó, S. (2018). “Show Me the Cup”: Reference with Continuous Representations. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10761. Springer, Cham. https://doi.org/10.1007/978-3-319-77113-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-77113-7_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77112-0
Online ISBN: 978-3-319-77113-7
eBook Packages: Computer ScienceComputer Science (R0)