survey

Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics

Authors:

Peratham Wiriyathammabhum,

Douglas Summers-Stay,

Cornelia Fermüller,

Yiannis AloimonosAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 49, Issue 4

Article No.: 71, Pages 1 - 44

https://doi.org/10.1145/3009906

Published: 12 December 2016 Publication History

Abstract

Integrating computer vision and natural language processing is a novel interdisciplinary field that has received a lot of attention recently. In this survey, we provide a comprehensive introduction of the integration of computer vision and natural language processing in multimedia and robotics applications with more than 200 key references. The tasks that we survey include visual attributes, image captioning, video captioning, visual question answering, visual retrieval, human-robot interaction, robotic actions, and robot navigation. We also emphasize strategies to integrate computer vision and natural language processing models as a unified theme of distributional semantics. We make an analog of distributional semantics in computer vision and natural language processing as image embedding and word embedding, respectively. We also present a unified view for the field and propose possible future directions.

References

[1]

Somak Aditya, Yezhou Yang, Chitta Baral, Cornelia Fermuller, and Yiannis Aloimonos. 2015. From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv preprint arXiv:1511.03292 (2015).

[2]

Eren Erdal Aksoy, Alexey Abramov, Johannes Dörr, Kejun Ning, Babette Dellen, and Florentin Wörgötter. 2011. Learning the semantics of object--action relations by observation. Int. J. Robot. Res. (2011), 0278364911410459.

Digital Library

[3]

Yiannis Aloimonos and Cornelia Fermüller. 2015. The cognitive dialogue: A new model for vision implementing common sense reasoning. Image Vis. Comput. 34 (2015), 42--44.

Digital Library

[4]

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. 2014. Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15, 1 (2014), 2773--2832.

Digital Library

[5]

Animashree Anandkumar, Daniel Hsu, and Sham M. Kakade. 2012a. A method of moments for mixture models and hidden Markov models. In COLT, Vol. 1. 4.

[6]

Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, and Sham M. Kakade. 2012b. A spectral algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems. 917--925.

Digital Library

[7]

Andrew J. Anderson, Elia Bruni, Ulisse Bordignon, Massimo Poesio, and Marco Baroni. 2013. Of words, eyes and brains: Correlating image-based distributional semantic models with neural representations of concepts. In EMNLP. 1960--1970.

[8]

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016a. Learning to compose neural networks for question answering. In NAACL.

[9]

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016b. Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 39--48.

[10]

Mark Andrews, Gabriella Vigliocco, and David Vinson. 2009. Integrating experiential and distributional data to learn semantic representations. Psychol. Rev. 116, 3 (2009), 463.

[11]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2425--2433.

Digital Library

[12]

Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. 2009. A survey of robot learning from demonstration. Robot. Auton. Syst. 57, 5 (2009), 469--483.

Digital Library

[13]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. In ICLR 2015.

[14]

Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The Berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics--Volume 1. Association for Computational Linguistics, 86--90.

Digital Library

[15]

Gökhan Bakir. 2007. Predicting structured data. MIT press, 2007.

Digital Library

[16]

Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2012. Abstract meaning representation (AMR) 1.0 specification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. ACL. 1533--1544.

[17]

Albert Bandura. 1974. Psychological Modeling: Conflicting Theories. Transaction Publishers.

[18]

Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, and others. 2012a. Video in sentences out. In UAI 2012.

Digital Library

[19]

Andrei Barbu, Aaron Michaux, Siddharth Narayanaswamy, and Jeffrey Mark Siskind. 2012b. Simultaneous object detection, tracking, and event recognition. In ACS 2012.

[20]

Kobus Barnard, Pinar Duygulu, David Forsyth, Nando De Freitas, David M. Blei, and Michael I. Jordan. 2003. Matching words and pictures. J. Mach. Learn. Res. 3 (2003), 1107--1135.

Digital Library

[21]

Kobus Barnard and David Forsyth. 2001. Learning the semantics of words and pictures. In Proceedings of the 8th IEEE International Conference on Computer Vision, 2001 (ICCV 2001), Vol. 2. IEEE, 408--415.

[22]

Marco Baroni. 2016. Grounding distributional semantics in the visual world. Lang. Ling. Compass 10, 1 (2016), 3--13.

[23]

Francisco Barranco, Cornelia Fermüller, and Yiannis Aloimonos. 2014. Contour motion estimation for asynchronous event-driven cameras. Proc. IEEE 102, 10 (2014), 1537--1556.

[24]

Daniel Barrett, Andrei Barbu, N. Siddharth, and Jeffrey Siskind. 2016. Saying what you're looking for: Linguistics meets video search. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 10 (Oct. 2016).

Digital Library

[25]

Jonathan Barron and Jitendra Malik. 2015. Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell. 37, 8 (2015), 1670--1687.

Digital Library

[26]

Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. 2008. Speeded-up robust features (SURF). Comput. Vis. Image Understand. 110, 3 (2008), 346--359.

Digital Library

[27]

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In Computer Vision--ECCV 2006. Springer, 404--417.

Digital Library

[28]

Michael Beetz, Suat Gedikli, Jan Bandouch, Bernhard Kirchlechner, Nico von Hoyningen-Huene, and Alexander Perzylo. 2007. Visually tracking football games based on TV broadcasts. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI).

Digital Library

[29]

Peter N. Belhumeur, João P. Hespanha, and David J. Kriegman. 1997. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. 19, 7 (1997), 711--720.

Digital Library

[30]

Mikhail Belkin and Partha Niyogi. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. Neur. Comput. 15, 6 (2003), 1373--1396.

Digital Library

[31]

Islam Beltagy, Stephen Roller, Pengxiang Cheng, Katrin Erk, and Raymond J. Mooney. 2015. Representing meaning with a combination of logical form and vectors. arXiv preprint arXiv:1505.06816 (2015).

[32]

Yoshua Bengio, Aaron Courville, and Pierre Vincent. 2013. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (2013), 1798--1828.

Digital Library

[33]

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3 (2003), 1137--1155.

Digital Library

[34]

Yoshua Bengio, Hugo Larochelle, Pascal Lamblin, Dan Popovici, Aaron Courville, Clarence Simard, Jerome Louradour, and Dumitru Erhan. 2007. Deep architectures for baby AI. (2007).

[35]

A. Berg, J. Deng, and L. Fei-Fei. 2010. Large scale visual recognition challenge (ILSVRC), 2010. Retrieved from http://www. image-net.org/challenges/LSVRC (2010).

[36]

Tamara Berg and Alexander C. Berg. 2009. Finding iconic images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009 (CVPR Workshops 2009). IEEE, 1--8.

[37]

Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee-Whye Teh, Erik Learned-Miller, and David A. Forsyth. 2004. Names and faces in the news. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04), Vol. 2. IEEE, II--848.

Digital Library

[38]

Tamara L. Berg, Alexander C. Berg, and Jonathan Shih. 2010. Automatic attribute discovery and characterization from noisy web data. In Computer Vision--ECCV 2010. Springer, 663--676.

Digital Library

[39]

Tamara L. Berg, David Forsyth, and others. 2006. Animals on the web. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 1463--1470.

Digital Library

[40]

Raffaella Bernardi, Ruket Cakici, Desmond Elliott, Aykut Erdem, Erkut Erdem, Nazli Ikizler-Cinbis, Frank Keller, Adrian Muscat, and Barbara Plank. 2016. Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Intell. Res. 55 (2016), 409--442.

[41]

David M. Blei and Michael I. Jordan. 2003. Modeling annotated data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 127--134.

Digital Library

[42]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022.

[43]

Benjamin S. Bloom and others. 1956. Taxonomy of educational objectives. Vol. 1: Cognitive domain. McKay, New York, NY (1956), 20--24.

[44]

Alexander M. Bronstein, Michael M. Bronstein, and Ron Kimmel. 2005. Three-dimensional face recognition. Int. J. Comput. Vis. 64, 1 (2005), 5--30.

Digital Library

[45]

Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 136--145.

Digital Library

[46]

Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res. 49 (2014), 1--47.

Digital Library

[47]

Donna Byron, Alexander Koller, Jon Oberlander, Laura Stoia, and Kristina Striegnitz. 2007. Generating instructions in virtual environments (GIVE): A challenge and an evaluation testbed for NLG. (2007).

[48]

Angelo Cangelosi. 2006. The grounding and sharing of symbols. Pragm. Cogn. 14, 2 (2006), 275--285.

[49]

Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr, and Tom M. Mitchell. 2010. Toward an architecture for never-ending language learning. In AAAI, Vol. 5. 3.

Digital Library

[50]

Marisa Carrasco. 2011. Visual attention: The past 25 years. Vis. Res. 51, 13 (2011), 1484--1525.

[51]

Joao Carreira and Cristian Sminchisescu. 2010. Constrained parametric min-cuts for automatic object segmentation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3241--3248.

[52]

Angel X. Chang, Manolis Savva, and Christopher D. Manning. 2014. Semantic parsing for text to 3d scene generation. ACL 2014 (2014), 17.

[53]

Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. 2015. HICO: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision. 1017--1025.

Digital Library

[54]

Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.

[55]

Anthony Chemero. 2003. An outline of a theory of affordances. Ecological Psychology 15, 2 (2003), 181--195.

[56]

David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies—Volume 1. Association for Computational Linguistics, 190--200.

Digital Library

[57]

David L. Chen and Raymond J. Mooney. 2008. Learning to sportscast: A test of grounded language acquisition. In Proceedings of the 25th International Conference on Machine Learning. ACM, 128--135.

Digital Library

[58]

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015a. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).

[59]

Xinlei Chen, Ashish Shrivastava, and Arpan Gupta. 2013. Neil: Extracting visual knowledge from web data. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, 1409--1416.

Digital Library

[60]

Zhigang Chen, Wei Lin, Qian Chen, Xiaoping Chen, Si Wei, Hui Jiang, and Xiaodan Zhu. 2015b. Revisiting word embedding for contrasting meaning. In Proceedings of ACL.

[61]

Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder--decoder approaches. Syntax Sem. Struct. Stat. Transl. (2014), 103.

[62]

Myung Jin Choi, Antonio Torralba, and Alan S. Willsky. 2012. Context models and out-of-context objects. Pattern Recogn. Lett. 33, 7 (2012), 853--862.

Digital Library

[63]

Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: A real-world web image database from national university of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 48.

Digital Library

[64]

Stephen Clark and Stephen Pulman. 2007. Combining symbolic and distributional models of meaning. In AAAI Spring Symposium: Quantum Interaction. 52--55.

[65]

Michael D. Cohen and Paul Bacdayan. 1994. Organizational routines are stored as procedural memory: Evidence from a laboratory study. Organiz. Sci. 5, 4 (1994), 554--568.

Digital Library

[66]

Nadav Cohen, Or Sharir, and Amnon Shashua. 2016. On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Annual Conference on Learning Theory. 698--728.

[67]

Silvia Coradeschi, Amy Loutfi, and Britta Wrede. 2013. A short review of symbol grounding in robotic and intelligent systems. KI-Künstliche Intell. 27, 2 (2013), 129--136.

[68]

Silvia Coradeschi and Alessandro Saffiotti. 2000. Anchoring symbols to sensor data: Preliminary report. In AAAI/IAAI. 129--135.

Digital Library

[69]

Nelson Cowan. 2008. What are the differences between long-term, short-term, and working memory? Progr. Brain Res. 169 (2008), 323--338.

[70]

Trevor Darrell. 2010. Learning Representations for Real-world Recognition. Retrieved from http://www.eecs.berkeley.edu/&sim;trevor/colloq.pdf UCB EECS Colloquium {Accessed: 2015 11 1}.

[71]

Pradipto Das, Chenliang Xu, Richard Doell, and Jason Corso. 2013. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2634--2641.

Digital Library

[72]

Hal Daumé III. 2007. Frustratingly easy domain adaptation. ACL 2007 (2007), 256.

[73]

Hal Daumé III, John Langford, and Daniel Marcu. 2009. Search-based structured prediction. Mach. Learn. 75, 3 (2009), 297--325.

Digital Library

[74]

Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems. 1269--1277.

Digital Library

[75]

Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Mensch, Margaret Mitchell, Karl Stratos, Kota Yamaguchi, Yejin Choi, Hal Daumé III, Alexander C. Berg, and others. 2012. Detecting visual text. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 762--772.

Digital Library

[76]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.

[77]

Alexey Dosovitskiy, Philipp Fischery, Eddy Ilg, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas Brox, and others. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2758--2766.

Digital Library

[78]

Susan T. Dumais. 2007. LSA and information retrieval: Getting back to basics. Handb. Latent Semant. Anal. (2007), 293--321.

[79]

Hugh Durrant-Whyte and Tim Bailey. 2006. Simultaneous localization and mapping: Part I. IEEE Robot. Autom. Mag. 13, 2 (2006), 99--110.

[80]

Pinar Duygulu, Kobus Barnard, Joao F. G. de Freitas, and David A. Forsyth. 2002. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Computer Vision ECCV 2002. Springer, 97--112.

Digital Library

[81]

Aleksandrs Ecins, Cornelia Fermuller, and Yiannis Aloimonos. 2014. Shadow free segmentation in still images using local density measure. In Proceedings of the 2014 IEEE International Conference on Computational Photography (ICCP). IEEE, 1--8.

[82]

Aleksandrs Ecins, Cornelia Fermuller, and Yiannis Aloimonos. 2016. Cluttered scene segmentation using the symmetry constraint. In Proceedings of the International Conference in Robotics and Automation (ICRA).

Digital Library

[83]

H. Eichenbaum. 2008. Memory. Scholarpedia 3, 3 (2008), 1747.

[84]

Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In EMNLP. 1292--1302.

[85]

Desmond Elliott and Frank Keller. 2014. Comparing automatic evaluation measures for image description. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Short Papers, Vol. 452. 457.

[86]

Oren Etzioni, Michele Banko, and Michael J. Cafarella. 2006. Machine reading. In AAAI, Vol. 6. 1517--1519.

Digital Library

[87]

Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88, 2 (2010), 303--338.

Digital Library

[88]

Rui Fang, Changsong Liu, Lanbo She, and Joyce Y. Chai. 2013. Towards situated dialogue: Revisiting referring expression generation. In EMNLP. 392--402.

[89]

Ali Farhadi. 2011. Designing Representational Architectures in Recognition. University of Illinois at Urbana-Champaign. Champaign, IL, USA.

[90]

Ali Farhadi, Ian Endres, and Derek Hoiem. 2010. Attribute-centric recognition for cross-category generalization. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2352--2359.

[91]

Alireza Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. 2009. Describing objects by their attributes. In IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009). IEEE, 1778--1785.

[92]

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Computer Vision--ECCV 2010. Springer, 15--29.

Digital Library

[93]

Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2014. Retrofitting word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166 (2014).

[94]

Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. 2015. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004 (2015).

[95]

Li Fei-Fei, Rob Fergus, and Pietro Perona. 2007. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 106, 1 (2007), 59--70.

Digital Library

[96]

S. L. Feng, Raghavan Manmatha, and Victor Lavrenko. 2004. Multiple Bernoulli relevance models for image and video annotation. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004 (CVPR’04)., Vol. 2. IEEE, II--1002.

Digital Library

[97]

Francis Ferraro, Nasrin Mostafazadeh, Ting-Hao Huang, Lucy Vanderwende, Jacob Devlin, Michel Galley, and Margaret Mitchell. 2015. A survey of current datasets for vision and language research. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 207--213.

[98]

Ronald A. Fisher. 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 2 (1936), 179--188.

[99]

Daryl Fougnie. 2008. The relationship between attention and working memory. New Res. Short-term Mem. (2008), 1--45.

[100]

D. F. Fouhey, A. Gupta, and A. Zisserman. 2016. 3D shape attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[101]

Jianlong Fu, Jinqiao Wang, Xin-Jing Wang, Yong Rui, and Hanqing Lu. 2015. What visual attributes characterize an object class? In Computer Vision--ACCV 2014. Springer, 243--259.

[102]

Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? Dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems. 2296--2304.

Digital Library

[103]

D. Garcia-Gasulla, J. Béjar, U. Cortés, E. Ayguadé, and J. Labarta. 2015. Extracting visual patterns from deep learning representations. arXiv preprint arXiv:1507.08818 (2015).

[104]

Peter Gärdenfors. 2014. The Geometry of Meaning: Semantics Based on Conceptual Spaces. MIT Press.

[105]

Konstantina Garoufi. 2014. Planning-based models of natural language generation. Lang. Ling. Compass 8, 1 (2014), 1--10.

[106]

Konstantina Garoufi and Alexander Koller. 2010. Automated planning for situated natural language generation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1573--1582.

Digital Library

[107]

Konstantina Garoufi, Maria Staudte, Alexander Koller, and Matthew W. Crocker. 2016. Exploiting listener gaze to improve situated communication in dynamic virtual environments. Cognitive Science 40, 7 (2016), 1671--1703.

[108]

Dan Garrette, Katrin Erk, and Raymond Mooney. 2014. A formal approach to linking logical form and vector-space lexical semantics. In Computing Meaning. Springer, 27--48.

[109]

Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440--1448.

Digital Library

[110]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 580--587.

Digital Library

[111]

Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis. 106, 2 (2014), 210--233.

Digital Library

[112]

Kristen Grauman and Bastian Leibe. 2010. Visual Object Recognition. Number 11. Morgan 8 Claypool Publishers.

Digital Library

[113]

Douglas Greenlee. 1978. Semiotic and significs. Int. Stud. Philos. 10 (1978), 251--254.

[114]

Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Sarad Venugopalan, Randy Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, 2712--2719.

Digital Library

[115]

Gutemberg Guerra-Filho and Yiannis Aloimonos. 2007. A language for human action. Computer 40, 5 (2007), 42--51.

Digital Library

[116]

Abhinav Gupta. 2009. Beyond nouns and verbs. (2009).

[117]

Saurabh Gupta, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2015. Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation. Int. J. Comput. Vis. 112, 2 (2015), 133--149.

Digital Library

[118]

Saurabh Gupta and Jitendra Malik. 2015. Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015).

[119]

Xintong Han, Bharat Singh, Vlad I. Morariu, and Larry S. Davis. 2015. Fast automatic video retrieval using web images. arXiv preprint arXiv:1512.03384 (2015).

[120]

Emily M. Hand and Rama Chellappa. 2016. Attributes for improved attributes: A multi-task network for attribute classification. arXiv preprint arXiv:1604.07360 (2016).

[121]

Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2014. Simultaneous detection and segmentation. In Proceedings of the European Conference on Computer Vision (ECCV).

[122]

Stevan Harnad. 1990. The symbol grounding problem. Physica D 42, 1 (1990), 335--346.

Digital Library

[123]

Zellig S. Harris. 1954. Distributional structure. Word 10, 2--3 (1954), 146--162.

[124]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.

[125]

Geremy Heitz and Daphne Koller. 2008. Learning spatial context: Using stuff to find things. In Computer Vision--ECCV 2008. Springer, 30--43.

Digital Library

[126]

Geoffrey E. Hinton. 1984. Distributed representations. Technical Report: Carnegie Melon University.

[127]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780.

Digital Library

[128]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. (2013), 853--899.

Digital Library

[129]

Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 50--57.

Digital Library

[130]

Bernhard Hommel, Jochen Müsseler, Gisa Aschersleben, and Wolfgang Prinzb. 2001. The theory of event coding (TEC): A framework for perception and action planning. Behav. Brain Sci. 24 (2001), 849--937.

[131]

Thanarat Horprasert, David Harwood, and Larry S. Davis. 1999. A statistical approach for real-time robust background subtraction and shadow detection. In IEEE ICCV, Vol. 99. 1--19.

[132]

Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. 2007. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. Technical Report. Technical Report 07-49, University of Massachusetts, Amherst.

[133]

Mark J. Huiskes and Michael S. Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. ACM, 39--43.

Digital Library

[134]

Julian Jaynes. 2000. The Origin of Consciousness in the Breakdown of the Bicameral Mind. Houghton Mifflin Harcourt.

[135]

Jiwoon Jeon, Victor Lavrenko, and Raghavan Manmatha. 2003. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 119--126.

Digital Library

[136]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[137]

Benjamin Johnston, Fangkai Yang, Rogan Mendoza, Xiaoping Chen, and Mary-Anne Williams. 2008. Ontology based object categorization for robots. In Practical Aspects of Knowledge Management. Springer, 219--231.

Digital Library

[138]

Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache. 2015. Learning visual features from large weakly supervised data. arXiv preprint arXiv:1511.02251 (2015).

[139]

Alap Karapurkar. 2008. Modeling human activities. Scholarly Paper Archive, Department of Computer Science, University of Maryland, College Park, MD, 20742.

[140]

Andrej Karpathy and Li Fei-Fei. 2015a. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.

[141]

Andrej Karpathy and Li Fei-Fei. 2015b. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[142]

Ryan Kiros, Yukun Zhu, Ruslan R. Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 3276--3284.

Digital Library

[143]

Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50, 2 (2002), 171--184.

Digital Library

[144]

Alexander Koller and Matthew Stone. 2007. Sentence generation as a planning problem. ACL 2007 (2007), 336.

[145]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalanditis, Li-Jia Li, David A. Shamma, Michael Bernstein, and Li Fei-Fei. 2016. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (2016), 45.

[146]

Niveda Krishnamoorthy, Girish Malkarnenkar, Raymond Mooney, Kate Saenko, and Sergio Guadarrama. 2013. Generating natural-language video descriptions using text-mined knowledge. NAACL HLT 2013 (2013), 10.

[147]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.

Digital Library

[148]

German Kruszewski, Denis Paperno, and Marco Baroni. 2015. Deriving boolean structures from distributional vectors. Trans. Assoc. Comput. Ling. 3 (2015), 375--388.

[149]

Gaurav Kulkarni, Visruth Premraj, Vicente Ordonez, Sudipta Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35, 12 (2013), 2891--2903.

Digital Library

[150]

Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In ICML.

[151]

Neeraj Kumar, Alexander C. Berg, Peter N. Belhumeur, and Shree K. Nayar. 2009. Attribute and simile classifiers for face verification. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision. IEEE, 365--372.

[152]

Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, and Yejin Choi. 2012. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 359--368.

Digital Library

[153]

Kevin Lai, Liefeng Bo, Xiaofeng Ren, and Dieter Fox. 2011. A large-scale hierarchical multi-view rgb-d object dataset. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1817--1824.

[154]

Kevin Lai and Dieter Fox. 2010. Object recognition in 3D point clouds using web data and domain adaptation. Int. J. Robot. Res. 29, 8 (2010), 1019--1037.

Digital Library

[155]

Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. 2009. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009 (CVPR 2009). IEEE, 951--958.

[156]

Victor Lavrenko, R. Manmatha, and Jiwoon Jeon. 2003. A model for learning the semantics of pictures. In Advances in Neural Information Processing Systems. None.

Digital Library

[157]

Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 2169--2178.

Digital Library

[158]

Dieu-Thu Le, Jasper Uijlings, and Raffaella Bernardi. 2014. Tuhoi: Trento universal human object interaction dataset. V8L Net 2014 (2014), 17.

[159]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.

[160]

Chee Wee Leong and Rada Mihalcea. 2011. Going beyond text: A hybrid image-text approach for measuring word relatedness. In IJCNLP. 1403--1407.

[161]

Stephen C. Levinson. 2001. Pragmatics. In International Encyclopedia of Social and Behavioral Sciences: Vol. 17. Pergamon, 11948--11954.

[162]

Omer Levy and Yoav Goldberg. 2014a. Dependencybased word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Vol. 2. 302--308.

[163]

Omer Levy and Yoav Goldberg. 2014b. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems. 2177--2185.

Digital Library

[164]

Li-Jia Li and Li Fei-Fei. 2007. What, where and who? Classifying events by scene and object recognition. In Proceedings of the IEEE 11th International Conference on Computer Vision. IEEE, 1--8.

[165]

Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220--228.

Digital Library

[166]

Xirong Li, Tiberio Uricchio, Lamberto Ballan, Marco Bertini, Cees G. M. Snoek, and Alberto Del Bimbo. 2016. Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Comput. Surv. 49, 1 (2016), 14.

Digital Library

[167]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Vol. 8.

[168]

Changsong Liu and Joyce Yue Chai. 2015. Learning to mediate perceptual differences in situated human-robot dialogue. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. AAAI Press, 2288--2294.

Digital Library

[169]

Changsong Liu, Lanbo She, Rui Fang, and Joyce Y. Chai. 2014a. Probabilistic labeling for efficient referential grounding based on collaborative discourse. In ACL (2). 13--18.

[170]

Jingen Liu, Benjamin Kuipers, and Silvio Savarese. 2011. Recognizing human actions by attributes. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3337--3344.

Digital Library

[171]

Tie-Yan Liu. 2009. Learning to rank for information retrieval. Found. Trends Inform. Retriev. 3, 3 (2009), 225--331.

Digital Library

[172]

Xiaobai Liu, Yibiao Zhao, and Song-Chun Zhu. 2014b. Single-view 3d scene parsing by attributed grammar. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 684--691.

Digital Library

[173]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431--3440.

[174]

David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 2 (2004), 91--110.

Digital Library

[175]

James MacQueen and others. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. 281--297.

[176]

Ameesh Makadia, Vladimir Pavlovic, and Sanjiv Kumar. 2008. A new baseline for image annotation. In Computer Vision--ECCV 2008. Springer, 316--329.

Digital Library

[177]

Alexis Maldonado, Humberto Alvarez, and Michael Beetz. 2012. Improving robot manipulation through fingertip perception. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2947--2954.

[178]

Jitendra Malik, Pablo Arbeláez, João Carreira, Katerina Fragkiadaki, Ross Girshick, Georgia Gkioxari, Saurabh Gupta, Bharath Hariharan, Abhishek Kar, and Shubham Tulsiani. 2016. The three Rs of computer vision: Recognition, reconstruction and reorganization. Pattern Recogn. Lett. 72 (2016), 4--14.

Digital Library

[179]

Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision. 1--9.

Digital Library

[180]

Jonathan Malmaud, Jonathan Huang, Vivek Rathod, Nick Johnston, Andrew Rabinovich, and Kevin Murphy. 2015. What’s cookin’? Interpreting cooking videos using text, speech and vision. In NAACL 2015.

[181]

Matthew Marge, Claire Bonial, Brendan Byrne, Taylor Cassidy, A. William Evans, Susan G. Hill, and Clare Voss. 2016. Applying the wizard-of-oz technique to multimodal human-robot dialogue. In Proceedings of RO-MAN (To appear).

[182]

David Marr. 1982. Vision: A Computational Investigation Into the Human Representation and Processing of Visual Information. Henry Holt and Co., Inc., New York, NY.

Digital Library

[183]

David R. Martin, Charless C. Fowlkes, and Jitendra Malik. 2004. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE Trans. Pattern Anal. Mach. Intell. 26, 5 (2004), 530--549.

Digital Library

[184]

Cynthia Matuszek^*, Nicholas FitzGerald^*, Luke Zettlemoyer, Liefeng Bo, and Dieter Fox. 2012. A joint model of language and perception for grounded attribute learning. In Proceedings of the 2012 International Conference on Machine Learning. Edinburgh, Scotland.

[185]

Cynthia Matuszek, Evan Herbst, Luke Zettlemoyer, and Dieter Fox. 2013. Learning to parse natural language commands to a robot control system. In Experimental Robotics. Springer, 403--415.

[186]

Nikolaos Mavridis. 2015. A review of verbal and non-verbal human--robot interactive communication. Robot. Auton. Syst. 63 (2015), 22--35.

Digital Library

[187]

Nikolaos Mavridis and Deb Roy. 2006. Grounded situation models for robots: Where words and percepts meet. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 4690--4697.

[188]

Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015a. Inferring networks of substitutable and complementary products. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 785--794.

Digital Library

[189]

Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015b. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 43--52.

Digital Library

[190]

Jon D. Mcauliffe and David M. Blei. 2008. Supervised topic models. In Advances in Neural Information Processing Systems. 121--128.

[191]

Brian McMahan and Matthew Stone. 2015. A Bayesian model of grounded color semantics. Trans. Assoc. Comput. Ling. 3 (2015), 103--115.

[192]

Ken McRae, George S. Cree, Mark S. Seidenberg, and Chris McNorgan. 2005. Semantic feature production norms for a large set of living and nonliving things. Behav. Res. Methods 37, 4 (2005), 547--559.

[193]

Chet Meyers and Thomas B. Jones. 1993. Promoting Active Learning. Strategies for the College Classroom. ERIC.

[194]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 3111--3119.

Digital Library

[195]

George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39--41.

Digital Library

[196]

Marvin Minsky. 2006. The emotion machine. New York: Pantheon (2006).

Digital Library

[197]

Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. 2012. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 747--756.

Digital Library

[198]

Saif M. Mohammad, Bonnie J. Dorr, Graeme Hirst, and Peter D. Turney. 2013. Computing lexical contrast. Comput. Ling. 39, 3 (2013), 555--590.

[199]

Raymond J. Mooney. 2008. Learning to connect language and perception. In AAAI. 1598--1601.

Digital Library

[200]

Raymond J. Mooney. 2013. Grounded Language Learning. (7 2013). 27th AAAI Conference on Artificial Intelligence, Washington 2013 Retrieved November 2, 2015 from http://videolectures.net/ aaai2013_mooney_language_learning/.

[201]

Yasuhide Mori, Hironobu Takahashi, and Ryuichi Oka. 1999. Image-to-word transformation based on dividing and vector quantizing images with words. In First International Workshop on Multimedia Intelligent Storage and Retrieval Management. Citeseer, 1--9.

[202]

Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In Proceedings of the International Workshop on Artificial Intelligence and Statistics. Citeseer, 246--252.

[203]

Charles William Morris. 1938. Foundations of the theory of signs. (1938).

[204]

Venkatesh N. Murthy, Subhransu Maji, and R. Manmatha. 2015. Automatic image annotation using deep learning representations. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 603--606.

Digital Library

[205]

Austin Myers, Ching L. Teo, Cornelia Fermüller, and Yiannis Aloimonos. 2015. Affordance detection of tool parts from geometric features. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).

[206]

Douglas L. Nelson, Cathy L. McEvoy, and Thomas A. Schreiber. 2004. The university of south Florida free association, rhyme, and word fragment norms. Behav. Res. Methods Instrum. Comput. 36, 3 (2004), 402--407.

[207]

Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry Vetrov. 2015. Tensorizing neural networks. In Advances in Neural Information Processing Systems 28 (NIPS).

Digital Library

[208]

Aude Oliva and Antonio Torralba. 2001. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 42, 3 (2001), 145--175.

Digital Library

[209]

Vicente Ordonez, Jia Deng, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2013. From large scale image categorization to entry-level categories. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

Digital Library

[210]

Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2text: Describing images using 1 million captioned photographs. In Advances in Neural Information Processing Systems. 1143--1151.

Digital Library

[211]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311--318.

Digital Library

[212]

Devi Parikh. 2009. Modeling context for image understanding: When, for what, and how? (2009).

Digital Library

[213]

Devi Parikh and Kristen Grauman. 2011. Relative attributes. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV). IEEE, 503--510.

Digital Library

[214]

Seyoung Park, Bruce Xiaohan Nie, and Song-Chun Zhu. 2016. Attribute and-or grammar for joint parsing of human attributes, part and pose. arXiv preprint arXiv:1605.02112 (2016).

[215]

Katerina Pastra and Yiannis Aloimonos. 2012. The minimalist grammar of action. Philos. Trans. Roy. Soc. B: Biol. Sci. 367, 1585 (2012), 103--117.

[216]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014) 12 (2014), 1532--1543.

[217]

Jean Piaget. 2013. Play, Dreams and Imitation in Childhood. Vol. 25. Routledge.

[218]

Tony Plate. 1997. A common framework for distributed representation schemes for compositional structure. Connectionist Systems for Knowledge Representation and Deduction (1997), 15--34.

[219]

Robert Pless and Richard Souvenir. 2009. A survey of manifold learning for images. IPSJ Trans. Comput. Vis. Appl. 1 (2009), 83--94.

[220]

J. Pont-Tuset, P. Arbelaez, J. Barron, F. Marques, and J. Malik. 2016. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Trans. Pattern Anal. Mach. Intelli. (2016).

Digital Library

[221]

Hoifung Poon and Pedro Domingos. 2009. Unsupervised semantic parsing. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1. Association for Computational Linguistics, 1--10.

Digital Library

[222]

Cecilia Quiroga-Clare. 2003. Language ambiguity: A curse and a blessing. Transl. J. 7, 1 (2003).

[223]

Gabriel A. Radvansky and Jeffrey M. Zacks. 2014. Event Cognition. Oxford University Press.

[224]

Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems. 2953--2961.

Digital Library

[225]

Giacomo Rizzolatti and Laila Craighero. 2004. The mirror-neuron system. Annu. Rev. Neurosci. 27 (2004), 169--192.

[226]

Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele. 2014. Coherent multi-sentence video description with variable level of detail. In Pattern Recognition. Springer, 184--195.

[227]

Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. 2015. A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3202--3212.

[228]

Stephen Roller and Sabine Schulte Im Walde. 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1146--1157.

[229]

Sascha Rothe and Hinrich Schütze. 2015. AutoExtend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the ACL.

[230]

Sam T. Roweis and Lawrence K. Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290, 5500 (2000), 2323--2326.

[231]

Deb Roy. 2005. Grounding words in perception and action: Computational insights. Trends Cogn. Sci. 9, 8 (2005), 390.

[232]

Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. 2008. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 77, 1-3 (2008), 157--173.

Digital Library

[233]

Fereshteh Sadeghi, C. Lawrence Zitnick, and Ali Farhadi. 2015. VISALOGY: Answering visual analogy questions. In Advances in Neural Information Processing Systems (NIPS-15).

Digital Library

[234]

Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1745--1752.

Digital Library

[235]

Karin Kipper Schuler. 2005. VerbNet: A broad-coverage, comprehensive verb lexicon (January 1, 2005). Dissertations available from ProQuest. Paper AAI3179808. http://repository.upenn.edu/dissertations/AAI3179808.

[236]

Roy Schwartz, Roi Reichart, and Ari Rappoport. 2015. Symmetric pattern based word embeddings for improved word similarity prediction. CoNLL 2015 (2015), 258.

[237]

Nishant Shukla, Caiming Xiong, and Song-Chun Zhu. 2015. A unified framework for human-robot knowledge transfer. In Proceedings of the 2015 AAAI Fall Symposium Series.

[238]

Narayanaswamy Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. 2014. Seeing what you’re told: Sentence-guided activity recognition in video. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 732--739.

Digital Library

[239]

Carina Silberer, Vittorio Ferrari, and Mirella Lapata. 2013. Models of semantic representation with visual attributes. In ACL (1). 572--582.

[240]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[241]

Bharat Singh, Xintong Han, Zhe Wu, Vlad I. Morariu, and Larry S. Davis. 2015. Selecting relevant web trained concepts for automated event retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 4561--4569.

Digital Library

[242]

Jeffrey Mark Siskind. 2001. Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. J. Artif. Intell. Res. 15 (2001), 31--90.

Digital Library

[243]

Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Ling. 2 (2014), 207--218.

[244]

Richard Socher, Cliff C. Lin, Chris Manning, and Andrew Y. Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). 129--136.

[245]

Nitish Srivastava and Ruslan Salakhutdinov. 2014. Multimodal learning with deep Boltzmann machines. J. Mach. Learn. Res. 15 (2014), 2949--2980.

Digital Library

[246]

Mark Steedman. 1996. Surface structure and interpretation. (1996).

[247]

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, and others. 2015. End-to-end memory networks. In Advances in Neural Information Processing Systems. 2440--2448.

Digital Library

[248]

Douglas Summers-Stay, Ching L. Teo, Yezhou Yang, Cornelia Fermüller, and Yiannis Aloimonos. 2012. Using a minimal action grammar for activity understanding in the real world. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4104--4111.

[249]

Douglas Alan Summers-Stay. 2013. Productive vision: Methods for automatic image comprehension. (2013).

[250]

Yuyin Sun, Liefeng Bo, and Dieter Fox. 2013. Attribute based object identification. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2096--2103.

[251]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.

[252]

Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In ACL.

[253]

Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lars Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1701--1708.

Digital Library

[254]

Makarand Tapaswi, Martin Bäuml, and Rainer Stiefelhagen. 2015. Book2movie: Aligning video scenes with book chapters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1827--1835.

[255]

Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding stories in movies through question-answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[256]

Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. 2005. Learning structured prediction models: A large margin approach. In Proceedings of the 22nd International Conference on Machine Learning. ACM, 896--903.

Digital Library

[257]

Stefanie Tellex, Ross Knepper, Adrian Li, Daniela Rus, and Nicholas Roy. 2014. Asking for help using inverse semantics. Proceedings of Robotics: Science and Systems, Berkeley, USA (2014).

[258]

Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R. Walter, Ashis Gopal Banerjee, Seth J. Teller, and Nicholas Roy. 2011. Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI.

Digital Library

[259]

Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. Science 290, 5500 (2000), 2319--2323.

[260]

Ching L. Teo, Cornelia Fermüller, and Yiannis Aloimonos. 2015. Fast 2D border ownership assignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5117--5125.

[261]

Ching L. Teo, Yezhou Yang, Hal Daumé III, Cornelia Fermüller, and Yiannis Aloimonos. 2012. Towards a Watson that sees: Language-guided action recognition for robots. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 374--381.

[262]

Jesse Thomason, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, and Raymond Mooney. 2014. Integrating language and vision to generate natural language descriptions of videos in the wild. In Proceedings of the 25th International Conference on Computational Linguistics (COLING).

[263]

Jesse Thomason, Shiqi Zhang, Raymond Mooney, and Peter Stone. 2015. Learning to interpret natural language commands through human-robot dialog. In Proceedings of the 24th International Joint Conference on Artificial Intelligence (IJCAI).

Digital Library

[264]

Sebastian Thrun, Wolfram Burgard, and Dieter Fox. 2005. Probabilistic Robotics. MIT Press.

[265]

Joseph Tighe and Svetlana Lazebnik. 2010. Superparsing: Scalable nonparametric image parsing with superpixels. In European Conference on Computer Vision. Springer, 352--365.

Digital Library

[266]

Atousa Torabi, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Using descriptive video services to create a large data source for video annotation research. arXiv preprint arXiv:1503.01070 (2015).

[267]

Antonio Torralba, Alexei Efros, and others. 2011. Unbiased look at dataset bias. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 1521--1528.

Digital Library

[268]

Anne-Marie Tousch, Stéphane Herbin, and Jean-Yves Audibert. 2012. Semantic hierarchies for image annotation: A survey. Pattern Recogn. 45, 1 (2012), 333--345.

Digital Library

[269]

Zhuowen Tu, Xiangrong Chen, Alan L. Yuille, and Song-Chun Zhu. 2005. Image parsing: Unifying segmentation, detection, and recognition. Int. J. Comput. Vis. 63, 2 (2005), 113--140.

Digital Library

[270]

Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 384--394.

Digital Library

[271]

Matthew Turk and Alex Pentland. 1991. Eigenfaces for recognition. J. Cogn. Neurosci. 3, 1 (1991), 71--86.

Digital Library

[272]

Jasper R. R. Uijlings and Vittorio Ferrari. 2015. Situational object boundary detection. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4712--4721.

[273]

Laurens J. P. van der Maaten, Eric O. Postma, and H. Jaap van den Herik. 2009. Dimensionality reduction: A comparative review. J. Mach. Learn. Res. 10, 1--41 (2009), 66--71.

[274]

Bernard Vauquois. 1968. Structures profondes et traduction automatique. Le système du CETA. Rev. Roum. Ling. 13, 2 (1968), 105--130.

[275]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.

[276]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015a. Sequence to sequence-video to text. In Proceedings of the IEEE International Conference on Computer Vision. 4534--4542.

Digital Library

[277]

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2015b. Translating videos to natural language using deep recurrent neural networks. In NAACL HLT.

[278]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.

[279]

Luis Von Ahn and Laura Dabbish. 2004. Labeling images with a computer game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 319--326.

Digital Library

[280]

Matthew R. Walter, Matthew E. Antone, Ekapol Chuangsuwanich, Andrew Correa, Randall Davis, Luke Fletcher, Emilio Frazzoli, Yuli Friedman, James R. Glass, Jonathan P. How, Jeong Hwan Jeon, Sertac Karaman, Brandon Luders, Nicholas Roy, Stefanie Tellex, and Seth J. Teller. 2015. A situationally aware voice-commandable robotic forklift working alongside people in unstructured outdoor environments. J. Field Robot. 32, 4 (2015), 590--628.

Digital Library

[281]

Chong Wang, David Blei, and Fei-Fei Li. 2009. Simultaneous image classification and annotation. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, 1903--1910.

[282]

Meng Wang, Bingbing Ni, Xian-Sheng Hua, and Tat-Seng Chua. 2012. Assistive tagging: A survey of multimedia tagging with human-computer joint exploration. ACM Comput. Surv. 44, 4 (2012), 25.

Digital Library

[283]

Ronald J. Williams. 1988. On the use of backpropagation in associative reinforcement learning. In Proceedings of the IEEE International Conference on Neural Networks, 1988. IEEE, 263--270.

[284]

Qi Wu, Peng Wang, Chunhua Shen, Anton van den Hengel, and Anthony Dick. 2015b. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[285]

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015a. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1912--1920.

[286]

Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In ICML.

[287]

Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus Rohrbach, and Kate Saenko. 2015b. A multi-scale multiple instance video description network. arXiv preprint arXiv:1505.05914 (2015).

[288]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015a. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning. 2048--2057.

[289]

Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. 2015c. A large-scale car dataset for fine-grained categorization and verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[290]

Yezhou Yang, Cornelia Fermuller, and Yiannis Aloimonos. 2013. Detection of manipulation action consequences (MAC). In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2563--2570.

Digital Library

[291]

Yezhou Yang, Cornelia Fermuller, Yiannis Aloimonos, and Eren Erdal Aksoy. 2015. Learning the semantics of manipulation action. The 53rd Annual Meeting of the Association for Computational Linguistics (ACL). Vol. 1. Association for Computational Linguistics, 676--686.

[292]

Yezhou Yang, Yi Li, Cornelia Fermuller, and Yiannis Aloimonos. 2015a. Neural self talk: Image understanding via continuous questioning and answering. arXiv preprint arXiv:1512.03460 (2015).

[293]

Yezhou Yang, Yi Li, Cornelia Fermuller, and Yiannis Aloimonos. 2015b. Robot learning manipulation action plans by “watching” unconstrained videos from the world wide web. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI-15).

Digital Library

[294]

Yezhou Yang, Ching Lik Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 444--454.

Digital Library

[295]

Yezhou Yang, Ching L. Teo, Cornelia Fermuller, and Yiannis Aloimonos. 2013. Robots with language: Multi-label visual recognition using NLP. In Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 4256--4262.

[296]

Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. 2011. Human action recognition by learning bases of action attributes and parts. In Proceedings of the 2011 International Conference on Computer Vision. IEEE, 1331--1338.

Digital Library

[297]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision. 4507--4515.

Digital Library

[298]

Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[299]

Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[300]

Dani Yogatama, Manaal Faruqui, Chris Dyer, and Noah A. Smith. 2014. Learning word representations with hierarchical sparse coding. arXiv preprint arXiv:1406.2035 (2014).

[301]

Nivasan Yogeswaran, Wenting Dang, William Taube Navaraj, Dhayalan Shakthivel, Saleem Khan, Emre Ozan Polat, Shoubhik Gupta, Hadi Heidari, Mohsen Kaboli, Leandro Lorenzelli, and others. 2015. New materials and advances in making electronic skin for interactive robots. Adv. Robot. 29, 21 (2015), 1359--1373.

[302]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Ling. 2 (2014), 67--78.

[303]

Haonan Yu, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. 2015b. A compositional framework for grounding language inference, generation, and acquisition in video. J. Artif. Intell. Res. (2015), 601--713.

Digital Library

[304]

Haonan Yu and Jeffrey Mark Siskind. 2013. Grounded language learning from video described with sentences. In ACL (1). 53--63.

[305]

Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2015c. Video paragraph captioning using hierarchical recurrent neural networks. arXiv preprint arXiv:1510.07712 (2015).

[306]

Licheng Yu, Eunbyung Park, Alexander C. Berg, and Tamara L. Berg. 2015a. Visual madlibs: Fill in the blank description generation and question answering. In Proceedings of the IEEE International Conference on Computer Vision. 2461--2469.

Digital Library

[307]

Xiaodong Yu, Cornelia Fermuller, Ching Lik Teo, Yezhou Yang, and Yiannis Aloimonos. 2011. Active scene recognition with vision and language. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV). IEEE, 810--817.

Digital Library

[308]

Konstantinos Zampogiannis, Yezhou Yang, Cornelia Fermuller, and Yiannis Aloimonos. 2015. Learning the spatial semantics of manipulation actions through preposition grounding. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 1389--1396.

[309]

John M. Zelle and Raymond J. Mooney. 1996. Learning to parse database queries using inductive logic programming. In Proceedings of the National Conference on Artificial Intelligence. 1050--1055.

Digital Library

[310]

Luke S. Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI).

Digital Library

[311]

Dengsheng Zhang, Md Monirul Islam, and Guojun Lu. 2012. A review on automatic image annotation techniques. Pattern Recogn. 45, 1 (2012), 346--362.

Digital Library

[312]

Rong Zhao and William I. Grosky. 2002. Bridging the semantic gap in image retrieval. Distributed Multimedia Databases: Techniques and Applications (2002), 14--36.

Digital Library

[313]

Wenyi Zhao, Rama Chellappa, P. Jonathon Phillips, and Azriel Rosenfeld. 2003. Face recognition: A literature survey. ACM Comput. Surv. 35, 4 (2003), 399--458.

Digital Library

[314]

Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. 2015. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1529--1537.

Digital Library

[315]

Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[316]

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015a. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision. 19--27.

Digital Library

[317]

Yuke Zhu, Ce Zhang, Christopher Ré, and Li Fei-Fei. 2015b. Building a large-scale multimodal knowledge base for visual question answering. arXiv preprint arXiv:1507.05670 (2015).

Cited By

Wang HGuo BZeng YChen MDing YZhang YYao LYu Z(2025)Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A ReviewACM Transactions on Information Systems10.1145/3715098Online publication date: 28-Jan-2025
https://doi.org/10.1145/3715098
AlSuwat MAl-Shareef SAlGhamdi M(2025)Audio-visual self-supervised representation learning: A surveyNeurocomputing10.1016/j.neucom.2025.129750(129750)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2025.129750
Денисенко ВГончаров АОкорокова В(2024)Системы для анализа информационной инфраструктуры и информационных объектов предприятияСовременные инновации, системы и технологии - Modern Innovations, Systems and Technologies10.47813/2782-2818-2024-4-4-0227-02374:4(0227-0237)Online publication date: 15-Nov-2024
https://doi.org/10.47813/2782-2818-2024-4-4-0227-0237
Show More Cited By

Index Terms

Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Robotics
2. Computing methodologies
  1. Artificial intelligence
  2. Machine learning

Recommendations

Learning distributed word representation with multi-contextual mixed embedding

Learning distributed word representations has been a popular method for various natural language processing applications such as word analogy and similarity, document classification and sentiment analysis. However, most existing word embedding models ...
Word2vec’s Distributed Word Representation for Hindi Word Sense Disambiguation
Distributed Computing and Internet Technology
Abstract
Word Sense Disambiguation (WSD) is the task of extracting an appropriate sense of an ambiguous word in a sentence. WSD is an essential task for language processing, as it is a pre-requisite for determining the closest interpretations of various ...
A lexicon of multiword expressions for linguistically precise, wide-coverage natural language processing

Since Sag et al. (2002) highlighted a key problem that had been underappreciated in the past in natural language processing (NLP), namely idiosyncratic multiword expressions (MWEs) such as idioms, quasi-idioms, cliches, quasi-cliches, institutionalized ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 49, Issue 4

December 2017

666 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3022634

Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering / University of Florida / Gainesville, FL 32611

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 December 2016

Accepted: 01 October 2016

Revised: 01 July 2016

Received: 01 February 2016

Published in CSUR Volume 49, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
2,559
Total Downloads

Downloads (Last 12 months)189
Downloads (Last 6 weeks)17

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang HGuo BZeng YChen MDing YZhang YYao LYu Z(2025)Enabling Harmonious Human-Machine Interaction with Visual-Context Augmented Dialogue System: A ReviewACM Transactions on Information Systems10.1145/3715098Online publication date: 28-Jan-2025
https://doi.org/10.1145/3715098
AlSuwat MAl-Shareef SAlGhamdi M(2025)Audio-visual self-supervised representation learning: A surveyNeurocomputing10.1016/j.neucom.2025.129750(129750)Online publication date: Feb-2025
https://doi.org/10.1016/j.neucom.2025.129750
Денисенко ВГончаров АОкорокова В(2024)Системы для анализа информационной инфраструктуры и информационных объектов предприятияСовременные инновации, системы и технологии - Modern Innovations, Systems and Technologies10.47813/2782-2818-2024-4-4-0227-02374:4(0227-0237)Online publication date: 15-Nov-2024
https://doi.org/10.47813/2782-2818-2024-4-4-0227-0237
Денисенко ВГончаров А(2024)Использование систем инвентаризацииСовременные инновации, системы и технологии - Modern Innovations, Systems and Technologies10.47813/2782-2818-2024-4-1-0101-01094:1(0101-0109)Online publication date: 12-Feb-2024
https://doi.org/10.47813/2782-2818-2024-4-1-0101-0109
Liang CYan W(2024)Human Action Recognition Based on YOLOv7Deep Learning, Reinforcement Learning, and the Rise of Intelligent Systems10.4018/979-8-3693-1738-9.ch006(126-145)Online publication date: 23-Feb-2024
https://doi.org/10.4018/979-8-3693-1738-9.ch006
Gao LLiu KLiu WWu JJin X(2024)Model extraction via active learning by fusing prior and posterior knowledge from unlabeled dataJournal of Intelligent & Fuzzy Systems10.3233/JIFS-239504(1-16)Online publication date: 19-Mar-2024
https://doi.org/10.3233/JIFS-239504
Adriana Mercioni MDaniel Căleanu C(2024)Computer Aided Diagnosis for Contrast-Enhanced Ultrasound Using a Small Hybrid Transformer Neural Network2024 International Symposium on Electronics and Telecommunications (ISETC)10.1109/ISETC63109.2024.10797437(1-4)Online publication date: 7-Nov-2024
https://doi.org/10.1109/ISETC63109.2024.10797437
Roslyn SA NR V(2024)Enhancing Accessibility for Visually Impaired Users: A BLIP2-Powered Image Description System in Tamil2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS)10.1109/ADICS58448.2024.10533565(01-06)Online publication date: 18-Apr-2024
https://doi.org/10.1109/ADICS58448.2024.10533565
Ouahi MKhoulji SLaarbi Kerkeb M(2024)Analysis of Deep Learning Development Platforms and Their Applications in Sustainable Development within the Education SectorE3S Web of Conferences10.1051/e3sconf/202447700098477(00098)Online publication date: 16-Jan-2024
https://doi.org/10.1051/e3sconf/202447700098
Li YLi YWei MLi G(2024)Innovation and challenges of artificial intelligence technology in personalized healthcareScientific Reports10.1038/s41598-024-70073-714:1Online publication date: 16-Aug-2024
https://doi.org/10.1038/s41598-024-70073-7
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents