Use of captions and other collateral text in understanding photographs

Srihari, Rohini K.

doi:10.1007/BF00849728

Use of captions and other collateral text in understanding photographs

Published: September 1994

Volume 8, pages 409–430, (1994)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Rohini K. Srihari¹

80 Accesses
12 Citations
Explore all metrics

Abstract

This research explores the interaction of textual and photographic information in image understanding. Specifically, it presents a computational model whereby textual captions are used as collateral information in the interpretation of the corresponding photographs. The final understanding of the picture and caption reflects a consolidation of the information obtained from each of the two sources and can thus be used in intelligent information retrieval tasks. The problem of building a general-purpose computer vision system withouta priori knowledge is very difficult at best. The concept of using collateral information in scene understanding has been explored in systems that use general scene context in the task of object identification. The work described here extends this notion by incorporating picture specific information. A multi-stage systemPICTION which uses captions to identify humans in an accompanying photograph is described. This provides a computationally less expensive alternative to traditional methods of face recognition. A key component of the system is the utilisation of spatial and characteristic constraints (derived from the caption) in labeling face candidates (generated by a face locator).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

What we see in a photograph: content selection for image captioning

Article 10 July 2020

Matching caricatures to photographs

Article 25 September 2015

Bildsuche: Erfahrungen zur Erkennung von Emblemen und zur automatischen Annotation von Segmenten

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Abella, A. & Kender, J. R. (1993). Qualitatively Describing Objects Using Spatial Prepositions. InProceedings of The Eleventh National Conference on Artificial Intelligence (AAAI-93), 536–540. Washington, DC.
Beckwith, R., Fellbaum, C., Gross, D. & Miller, G. A. (1991). WordNet: A Lexical Database Organized on Psycholinguistic Principles. InLexicons: Using On-line Resources to Build a Lexicon. Lawrence Erlbaum: Hillsdale, NJ.
Google Scholar
Biederman, I. (1988). Aspects and Extensions of a Theory of Human Image Understanding. In Pylyshyn, Z. (ed.)Computational Processes in Human Vision: An interdisciplinary perspective. Ablex: Norwood, NJ.
Google Scholar
Cullen, P. B., Hull, J. J. & Srihari, S. N. (1992). A Constraint Satisfaction Approach to the Reslution of Uncertainty in Image Interpretation. InProceedings of The Eighth Conference on Artificial Intelligence for Applications, 123–133. Monterey, CA.
Dahlgren, K., McDowell, J. & Stabler Jr., E. P. (1989). Knowledge Representation for Commonsense Reasoning with Text.Computational Linguistics 15(3): 149–170.
Google Scholar
Govindaraju, V., Srihari, S. N. & Sher, D. B. (1992). A Computational Model for Face Location based on Cognitive Principles. InProceedings of The 10th National Conference on Artificial Intelligence (AAAI-92), 350–355. San Jose, CA.
ISX Corporation (1991).LOOM Users Guide, Version 1.4.
Jolicoeur, P., Gluck, M. A. & Kosslyn, S. M. (1984). Pictures and Names: Making the Connection.Cognitive Psychology 16: 243–275.
Google Scholar
Knight, K. & Luk, S. (1994). Building a Large Scale Knowledge Base for Machine Translation. InProceedings of AAAI-94 (forthcoming). Seattle, WA.
Linde, D. J. (1982). Picture-Word Differences in Decision Latency.Journal of Experimental Psychology: Learning, Memory and Cognition 8: 584–598.
Google Scholar
Mani, I., MacMillan, T. R., Luperfoy, S., Lusher, E. P. & Laskowski, S. J. (1993). Identifying Unknown Proper Names in Newswire Text. InProceedings of The Workshop on Acquisition of Lexical Knowledge from Text, 44–54. Columbus, Ohio.
Mohr, R. & Henderson, T. C. (1986). Arc and path consistency revisited.Artificial Intelligence 28: 225–233.
Article Google Scholar
Neimann, H., Sagerer, G. F. Schroder, S. & Kummert, F. (1990). Ernest: A Semantic Network System for Pattern Understanding.Pattern Analysis and Machine Intelligence 12(9): 883–905.
Google Scholar
Pentland, A., Starner, T., Etcoff, N., Masoiu, A., Oliyide, O. & Turk, M. (1993). Experiments with Eigenfaces. InIJCAI-93 Workshop on Looking at People: Recognition and Interpretation of Human Action. Chambery, France.
SPIE. (1994).Proceedings of the Conference on Storage and Retrieval for Image and Video Databases (Vol. 2185), (forthcoming). SPIE Press: Bellingham, WA.
Google Scholar
Srihari, R. K. (1991a). Extracting Visual Information from Text: Using Captions to Label Human Faces in Newspaper Photographs (Ph.D. thesis). Dept. of Computer Science Technical Report 91-17, State University of New York at Buffalo.
Srihari, R. K. (1991b). PICTION: A System that Uses Captions to Label Human Faces in Newspaper Photographs. InProceedings of The 9th National Conference on Artificial Intelligence (AAAI-91), 80–85. Anaheim, CA.
Srihari, R. K. (1993). Intelligent Document Understanding: Understanding Photos with Captions. InProceedings of The International Conference on Document Analysis and Recognition (ICDAR-93), 664–667. Tsukuba City, Japan.
Srihari, R. K. & Burhans, D. T. (1994). Visual Semantics: Extracting Visual Information from Text Accompanying Pictures. InProceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), 793–798. Seattle, WA.
Strat, T. M. & Fischler, M. A. (1991). Context-Based Vision: Recognizing Objects Using Information from Both 2-D and 3-D Imagery.IEEE PAMI 13(10): 1050–1065.
Google Scholar
Tomita, M. (1987). An Efficient Augmented-Context-Free Parsing Algorithm.Computational Linguistics 13(1–2): 31–46.
Google Scholar
Weymouth, T. (1986).Using Object Descriptions in a Schema Network for Machine Vision. Ph.D. Dissertation, University of Masschussetts at Amherst.

Download references

Author information

Authors and Affiliations

Center of Excellence for Document Analysis and Recognition (CEDAR), and Department of Computer Science, State University of New York at Buffalo, UB Commons, 520 Lee Entrance — Suite 202, 14228-2567, Buffalo, NY, USA
Rohini K. Srihari

Authors

Rohini K. Srihari
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

This work was supported in part by ARPA Contract 93-F148900-000. I would like to thank William Rapaport for serving as my advisor in my doctoral work; Venu Govindaraju for his work on the face locator; and more recently, Rajiv Chopra, Debra Burhans and Toshio Morita for their work in the new implementation of PICTION as well as valuable feedback.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Srihari, R.K. Use of captions and other collateral text in understanding photographs. Artif Intell Rev 8, 409–430 (1994). https://doi.org/10.1007/BF00849728

Download citation

Issue Date: September 1994
DOI: https://doi.org/10.1007/BF00849728

Key words

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Use of captions and other collateral text in understanding photographs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

What we see in a photograph: content selection for image captioning

Matching caricatures to photographs

Bildsuche: Erfahrungen zur Erkennung von Emblemen und zur automatischen Annotation von Segmenten

Explore related subjects

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Subscribe and save

Buy Now