Abstract
Research in vision and language has traditionally remained separate in part because the classic task of generating a representation of a given image or sentence has resulted in an emphasis on low level structural aspects of these media. In this paper we argue that image and language understanding should be approached with the intent of facilitating the performance of a task. Under this view research in image and language understanding must confront common issues that arise as a task is pursued. Language and images are both input that can be used to maintain a model of a task. We argue that a model may be maintained by incorporating changes in the scene that can be characterized at a high level of abstraction yet manifest themselves at relatively low levels of analysis. Existing task-relevant models and the associated domain knowledge are used to expect specific changes and disambiguate the interpretation of these changes, thereby allowing them to modify the existing model. From this perspective, understanding input is largely independent of the modality of the input.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aloimonos, J., Bandapadhay, A. & Weiss, I. (1987). Active Vision. In Proceedings ofThe First International Conference on Computer Vision.
Bajscy, R. (1988).Active Perception. In Proceedings ofThe IEEE.76: 996–1005.
Ballard, D. H. (1991). Animate Vision.Artificial Intelligence 48(1): 57–86.
Berwick, R. C., Abnewy, S. P. & Tenny, C. (eds.) (1991).Principle-Based Parsing: Computation and Psycholinguistics. Kluwer: Dordrecht.
Birnbaum, L., Brand, M. & Cooper P. (1993). Looking for Trouble: Using Causal Semantics to Direct Focus of Attention. In Proceedings ofThe Fourth International Conference on Computer Vision ICCV '93, Berlin, Germany.
Charniak, E. & McDermott, D. (1985).Introduction to Artificial Intelligence, 89. Addison-Wesley: Reading, MA.
Chomsky, N. (1965).Aspects of the Theory of Syntax. MIT Press: Cambridge, MA.
Coombs, D. J. & Brown, C. M. (1992).Intelligent Gaze Control in Binocular Vision. Department of Computer Science. University of Rochester.
Fano, A. & Cooper, P. (1994). Maintaining Visual Models of a Scene Using Change Primitives. In Proceedings ofThe Computer Vision and Pattern Recognition Conference, Seattle.
Ferguson, W., Bareiss, R., Birnbaum, L. & Osgood, R. (1992).Ask Systems: An Approach to the Realization of Story-Based Teachers. Technical Report #22, The Institute for the Learning Sciences, Northwestern University, Evanston, IL.
Marcus, M. P. (1980).A Theory of Syntactic Recognition for Natural Language. MIT Press: Cambridge, MA.
Papert, S. (1980).Mindstorms: Children, Computers, and Powerful Ideas. Basic Books: New York.
Poggio, T., Torre, V. & Koch, C. (1987). Computational Vision and Regularization theory. In Fischler, M. & Firschein, O. (eds.),Readings In Computer Vision. Morgan Kaufman: Los Altos, CA.
Prokopowicz, P. & Cooper, P. (1993)The Dynamic Retina: Contrast and Motion Detection for Active Vision. Forthcoming Technical Report. The Institute for the Learning Sciences. Northwestern University.
Riesbeck, C. & Martin, C. E. (1985).Direct Memory Access Parsing. Technical Report #354. Department of Computer Science, Yale University.
Schank, R. (1977) Rules and Topics in Conversation.Cognitive Science 1: 421–441.
Schank, R. (1982).Dynamic Memory. Cambridge University Press: Cambridge.
Schank, R., Fano, A., Bell, B. & Jona, M. The Design of Goal-Based Scenarios.The Journal of the Learning Sciences 3(4).
Swain, M. J. (1990).Color Indexing. Technical Report #360. Department of Computer Science. University of Rochester.
Tomita, M. (1986).Efficient Parsing for Natural Language: A Fast Algorithm for Practical Systems. Kluwer: Boston.
Whitehead, A. N. (1929).The Aims of Education. Macmillan: New York.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Schank, R.C., Fano, A. Memory and expectations in learning, language, and visual understanding. Artif Intell Rev 9, 261–271 (1995). https://doi.org/10.1007/BF00849039
Issue Date:
DOI: https://doi.org/10.1007/BF00849039