Abstract
This research explores the interaction of textual and photographic information in image understanding. Specifically, it presents a computational model whereby textual captions are used as collateral information in the interpretation of the corresponding photographs. The final understanding of the picture and caption reflects a consolidation of the information obtained from each of the two sources and can thus be used in intelligent information retrieval tasks. The problem of building a general-purpose computer vision system withouta priori knowledge is very difficult at best. The concept of using collateral information in scene understanding has been explored in systems that use general scene context in the task of object identification. The work described here extends this notion by incorporating picture specific information. A multi-stage systemPICTION which uses captions to identify humans in an accompanying photograph is described. This provides a computationally less expensive alternative to traditional methods of face recognition. A key component of the system is the utilisation of spatial and characteristic constraints (derived from the caption) in labeling face candidates (generated by a face locator).
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abella, A. & Kender, J. R. (1993). Qualitatively Describing Objects Using Spatial Prepositions. InProceedings of The Eleventh National Conference on Artificial Intelligence (AAAI-93), 536–540. Washington, DC.
Beckwith, R., Fellbaum, C., Gross, D. & Miller, G. A. (1991). WordNet: A Lexical Database Organized on Psycholinguistic Principles. InLexicons: Using On-line Resources to Build a Lexicon. Lawrence Erlbaum: Hillsdale, NJ.
Biederman, I. (1988). Aspects and Extensions of a Theory of Human Image Understanding. In Pylyshyn, Z. (ed.)Computational Processes in Human Vision: An interdisciplinary perspective. Ablex: Norwood, NJ.
Cullen, P. B., Hull, J. J. & Srihari, S. N. (1992). A Constraint Satisfaction Approach to the Reslution of Uncertainty in Image Interpretation. InProceedings of The Eighth Conference on Artificial Intelligence for Applications, 123–133. Monterey, CA.
Dahlgren, K., McDowell, J. & Stabler Jr., E. P. (1989). Knowledge Representation for Commonsense Reasoning with Text.Computational Linguistics 15(3): 149–170.
Govindaraju, V., Srihari, S. N. & Sher, D. B. (1992). A Computational Model for Face Location based on Cognitive Principles. InProceedings of The 10th National Conference on Artificial Intelligence (AAAI-92), 350–355. San Jose, CA.
ISX Corporation (1991).LOOM Users Guide, Version 1.4.
Jolicoeur, P., Gluck, M. A. & Kosslyn, S. M. (1984). Pictures and Names: Making the Connection.Cognitive Psychology 16: 243–275.
Knight, K. & Luk, S. (1994). Building a Large Scale Knowledge Base for Machine Translation. InProceedings of AAAI-94 (forthcoming). Seattle, WA.
Linde, D. J. (1982). Picture-Word Differences in Decision Latency.Journal of Experimental Psychology: Learning, Memory and Cognition 8: 584–598.
Mani, I., MacMillan, T. R., Luperfoy, S., Lusher, E. P. & Laskowski, S. J. (1993). Identifying Unknown Proper Names in Newswire Text. InProceedings of The Workshop on Acquisition of Lexical Knowledge from Text, 44–54. Columbus, Ohio.
Mohr, R. & Henderson, T. C. (1986). Arc and path consistency revisited.Artificial Intelligence 28: 225–233.
Neimann, H., Sagerer, G. F. Schroder, S. & Kummert, F. (1990). Ernest: A Semantic Network System for Pattern Understanding.Pattern Analysis and Machine Intelligence 12(9): 883–905.
Pentland, A., Starner, T., Etcoff, N., Masoiu, A., Oliyide, O. & Turk, M. (1993). Experiments with Eigenfaces. InIJCAI-93 Workshop on Looking at People: Recognition and Interpretation of Human Action. Chambery, France.
SPIE. (1994).Proceedings of the Conference on Storage and Retrieval for Image and Video Databases (Vol. 2185), (forthcoming). SPIE Press: Bellingham, WA.
Srihari, R. K. (1991a). Extracting Visual Information from Text: Using Captions to Label Human Faces in Newspaper Photographs (Ph.D. thesis). Dept. of Computer Science Technical Report 91-17, State University of New York at Buffalo.
Srihari, R. K. (1991b). PICTION: A System that Uses Captions to Label Human Faces in Newspaper Photographs. InProceedings of The 9th National Conference on Artificial Intelligence (AAAI-91), 80–85. Anaheim, CA.
Srihari, R. K. (1993). Intelligent Document Understanding: Understanding Photos with Captions. InProceedings of The International Conference on Document Analysis and Recognition (ICDAR-93), 664–667. Tsukuba City, Japan.
Srihari, R. K. & Burhans, D. T. (1994). Visual Semantics: Extracting Visual Information from Text Accompanying Pictures. InProceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), 793–798. Seattle, WA.
Strat, T. M. & Fischler, M. A. (1991). Context-Based Vision: Recognizing Objects Using Information from Both 2-D and 3-D Imagery.IEEE PAMI 13(10): 1050–1065.
Tomita, M. (1987). An Efficient Augmented-Context-Free Parsing Algorithm.Computational Linguistics 13(1–2): 31–46.
Weymouth, T. (1986).Using Object Descriptions in a Schema Network for Machine Vision. Ph.D. Dissertation, University of Masschussetts at Amherst.
Author information
Authors and Affiliations
Additional information
This work was supported in part by ARPA Contract 93-F148900-000. I would like to thank William Rapaport for serving as my advisor in my doctoral work; Venu Govindaraju for his work on the face locator; and more recently, Rajiv Chopra, Debra Burhans and Toshio Morita for their work in the new implementation of PICTION as well as valuable feedback.
Rights and permissions
About this article
Cite this article
Srihari, R.K. Use of captions and other collateral text in understanding photographs. Artif Intell Rev 8, 409–430 (1994). https://doi.org/10.1007/BF00849728
Issue Date:
DOI: https://doi.org/10.1007/BF00849728