Skip to main content
Log in

Use of captions and other collateral text in understanding photographs

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

This research explores the interaction of textual and photographic information in image understanding. Specifically, it presents a computational model whereby textual captions are used as collateral information in the interpretation of the corresponding photographs. The final understanding of the picture and caption reflects a consolidation of the information obtained from each of the two sources and can thus be used in intelligent information retrieval tasks. The problem of building a general-purpose computer vision system withouta priori knowledge is very difficult at best. The concept of using collateral information in scene understanding has been explored in systems that use general scene context in the task of object identification. The work described here extends this notion by incorporating picture specific information. A multi-stage systemPICTION which uses captions to identify humans in an accompanying photograph is described. This provides a computationally less expensive alternative to traditional methods of face recognition. A key component of the system is the utilisation of spatial and characteristic constraints (derived from the caption) in labeling face candidates (generated by a face locator).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abella, A. & Kender, J. R. (1993). Qualitatively Describing Objects Using Spatial Prepositions. InProceedings of The Eleventh National Conference on Artificial Intelligence (AAAI-93), 536–540. Washington, DC.

  • Beckwith, R., Fellbaum, C., Gross, D. & Miller, G. A. (1991). WordNet: A Lexical Database Organized on Psycholinguistic Principles. InLexicons: Using On-line Resources to Build a Lexicon. Lawrence Erlbaum: Hillsdale, NJ.

    Google Scholar 

  • Biederman, I. (1988). Aspects and Extensions of a Theory of Human Image Understanding. In Pylyshyn, Z. (ed.)Computational Processes in Human Vision: An interdisciplinary perspective. Ablex: Norwood, NJ.

    Google Scholar 

  • Cullen, P. B., Hull, J. J. & Srihari, S. N. (1992). A Constraint Satisfaction Approach to the Reslution of Uncertainty in Image Interpretation. InProceedings of The Eighth Conference on Artificial Intelligence for Applications, 123–133. Monterey, CA.

  • Dahlgren, K., McDowell, J. & Stabler Jr., E. P. (1989). Knowledge Representation for Commonsense Reasoning with Text.Computational Linguistics 15(3): 149–170.

    Google Scholar 

  • Govindaraju, V., Srihari, S. N. & Sher, D. B. (1992). A Computational Model for Face Location based on Cognitive Principles. InProceedings of The 10th National Conference on Artificial Intelligence (AAAI-92), 350–355. San Jose, CA.

  • ISX Corporation (1991).LOOM Users Guide, Version 1.4.

  • Jolicoeur, P., Gluck, M. A. & Kosslyn, S. M. (1984). Pictures and Names: Making the Connection.Cognitive Psychology 16: 243–275.

    Google Scholar 

  • Knight, K. & Luk, S. (1994). Building a Large Scale Knowledge Base for Machine Translation. InProceedings of AAAI-94 (forthcoming). Seattle, WA.

  • Linde, D. J. (1982). Picture-Word Differences in Decision Latency.Journal of Experimental Psychology: Learning, Memory and Cognition 8: 584–598.

    Google Scholar 

  • Mani, I., MacMillan, T. R., Luperfoy, S., Lusher, E. P. & Laskowski, S. J. (1993). Identifying Unknown Proper Names in Newswire Text. InProceedings of The Workshop on Acquisition of Lexical Knowledge from Text, 44–54. Columbus, Ohio.

  • Mohr, R. & Henderson, T. C. (1986). Arc and path consistency revisited.Artificial Intelligence 28: 225–233.

    Article  Google Scholar 

  • Neimann, H., Sagerer, G. F. Schroder, S. & Kummert, F. (1990). Ernest: A Semantic Network System for Pattern Understanding.Pattern Analysis and Machine Intelligence 12(9): 883–905.

    Google Scholar 

  • Pentland, A., Starner, T., Etcoff, N., Masoiu, A., Oliyide, O. & Turk, M. (1993). Experiments with Eigenfaces. InIJCAI-93 Workshop on Looking at People: Recognition and Interpretation of Human Action. Chambery, France.

  • SPIE. (1994).Proceedings of the Conference on Storage and Retrieval for Image and Video Databases (Vol. 2185), (forthcoming). SPIE Press: Bellingham, WA.

    Google Scholar 

  • Srihari, R. K. (1991a). Extracting Visual Information from Text: Using Captions to Label Human Faces in Newspaper Photographs (Ph.D. thesis). Dept. of Computer Science Technical Report 91-17, State University of New York at Buffalo.

  • Srihari, R. K. (1991b). PICTION: A System that Uses Captions to Label Human Faces in Newspaper Photographs. InProceedings of The 9th National Conference on Artificial Intelligence (AAAI-91), 80–85. Anaheim, CA.

  • Srihari, R. K. (1993). Intelligent Document Understanding: Understanding Photos with Captions. InProceedings of The International Conference on Document Analysis and Recognition (ICDAR-93), 664–667. Tsukuba City, Japan.

  • Srihari, R. K. & Burhans, D. T. (1994). Visual Semantics: Extracting Visual Information from Text Accompanying Pictures. InProceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), 793–798. Seattle, WA.

  • Strat, T. M. & Fischler, M. A. (1991). Context-Based Vision: Recognizing Objects Using Information from Both 2-D and 3-D Imagery.IEEE PAMI 13(10): 1050–1065.

    Google Scholar 

  • Tomita, M. (1987). An Efficient Augmented-Context-Free Parsing Algorithm.Computational Linguistics 13(1–2): 31–46.

    Google Scholar 

  • Weymouth, T. (1986).Using Object Descriptions in a Schema Network for Machine Vision. Ph.D. Dissertation, University of Masschussetts at Amherst.

Download references

Author information

Authors and Affiliations

Authors

Additional information

This work was supported in part by ARPA Contract 93-F148900-000. I would like to thank William Rapaport for serving as my advisor in my doctoral work; Venu Govindaraju for his work on the face locator; and more recently, Rajiv Chopra, Debra Burhans and Toshio Morita for their work in the new implementation of PICTION as well as valuable feedback.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Srihari, R.K. Use of captions and other collateral text in understanding photographs. Artif Intell Rev 8, 409–430 (1994). https://doi.org/10.1007/BF00849728

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00849728

Key words

Navigation