Computational linguistics for metadata building (CLiMB): using text mining for the automatic identification, categorization, and disambiguation of subject terms for image metadata

Klavans, Judith L.; Sheffield, Carolyn; Abels, Eileen; Lin, Jimmy; Passonneau, Rebecca; Sidhu, Tandeep; Soergel, Dagobert

doi:10.1007/s11042-008-0253-9

Computational linguistics for metadata building (CLiMB): using text mining for the automatic identification, categorization, and disambiguation of subject terms for image metadata

Published: 08 November 2008

Volume 42, pages 115–138, (2009)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Judith L. Klavans^1,2,3,
Carolyn Sheffield¹,
Eileen Abels⁴,
Jimmy Lin^1,2,3,
Rebecca Passonneau⁵,
Tandeep Sidhu¹ &
…
Dagobert Soergel¹

400 Accesses
Explore all metrics

Abstract

In this paper, we present a system using computational linguistic techniques to extract metadata for image access. We discuss the implementation, functionality and evaluation of an image catalogers’ toolkit, developed in the Computational Linguistics for Metadata Building (CLiMB) research project. We have tested components of the system, including phrase finding for the art and architecture domain, functional semantic labeling using machine learning, and disambiguation of terms in domain-specific text vis a vis a rich thesaurus of subject terms, geographic and artist names. We present specific results on disambiguation techniques and on the nature of the ambiguity problem given the thesaurus, resources, and domain-specific text resource, with a comparison of domain-general resources and text. Our primary user group for evaluation has been the cataloger expert with specific expertise in the fields of painting, sculpture, and vernacular and landscape architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards Automatic Cataloging of Image and Textual Collections with Wikipedia

Developing and Aligning a Detailed Controlled Vocabulary for Artwork

NLP and Archaeology: A View from a Digital Archive

Notes

Some examples include OntoImage’2006—First International “Language Resources for Content-Based Image Retrieval” Workshop, held in conjunction with the Language Resources and Evaluation Conference (LREC) 2006, http://www.lrec-conf.org/lrec2006; OntoImage’2008—Second 2nd International “Language Resources for Content-Based Image Retrieval” Workshop, held in conjunction with LREC’2008, http://www.dfki.de/∼declerck/ontoimage.html; workshops on computational linguistics for image access held at the Visual Resources Association annual meetings, 2006, 2007, 2008, http://www.vraweb.org.
http://vraweb.org/ccoweb/cco/parttwo_chapter6.html.
One such project, T ³ : Text, Tagging and Trust to Improve Image Access for Museums and Libraries, has just been funded from the Institute for Museum and Library Science, imls.gov.
Some metadata standards mentioned in Baca 2003 were: Categories for the Description of Works of Art (CDWA) from the Getty Research Institute and Cataloging Cultural Objects (CCO) from the Visual Resources Association.
Notable controlled vocabularies noted in Baca 2003 were: Library of Congress Subject Headings; Library of Congress Name Authority File; the Getty Vocabularies; Thesaurus for Graphic Materials I and II.
http://www.vernaculararchitectureforum.org/.
http://www.sah.org/.
http://www.lair.umd.edu/.
http://www.artstor.org.
Both the tagger and parser are available at: http://nlp.stanford.edu/software.
Lucene is a search engine library: http://lucene.apache.org.
Getty resources can be accessed at: http://getty.edu/research/conducting_research/vocabularies/aat.
According to the documentation on the TGN, natural order refers to searching on the most common order of a name, e.g. Al-Hoceima, whereas inverted order would be Hoceima, Al-.
Steve: The Museum Social Tagging Project. http://www.steve.museum.
Luis von Ahn: The ESP Game at Games with a Purpose (GWAP).
http://www.gwap.com/gwap/gamesPreview/espgame/.
Jennifer Golbeck: FilmTrust. http://www.mindswap.org.

References

Anderson JD, Perez-Carballo J (2001) The nature of indexing: how humans and machines analyze messages and texts for retrieval. Part I: research, and the nature of human indexing. Inf Process Manag 37:231–254
Article MATH Google Scholar
Anderson JD, Perez-Carballo J (2001) The nature of indexing: how humans and machines analyze messages and texts for retrieval—part II: machine indexing, and the allocation of human versus machine effort. Inf Process Manag 37:255–277
Article MATH Google Scholar
Baca M (2003) Practical issues in applying metadata schemas and controlled vocabularies to cultural heritage information. Cat Classif Q 36(3/4):47–55
Article Google Scholar
Banerjee S, Pedersen T (2003) Extended gloss overlaps as a measure of semantic relatedness. Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pp 805–810
Barnard K, Forsyth DA (2001) Learning the semantics of words and pictures. Proceedings of International Conference on Computer Vision, pp 408–415
Brill E (1995) Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Comput Linguist 21(4):543–565
Google Scholar
Charniak E (1997) Statistical techniques for natural language parsing. AI Mag 18(4):33–44
Google Scholar
Chen H (2001) An analysis of image retrieval tasks in the field of art history. Inf Process Manag 37:701–720
Article MATH Google Scholar
Choi Y, Rasmussen E (2003) Searching for images: the analysis of users’ queries for image retrieval in American history. J Am Soc Inf Sci Technol 54:498–511
Article Google Scholar
Church KW (1988) A stochastic parts program and noun phrase parser for unrestricted text. Proceedings of the Second Conference on Applied Natural Language Processing, Austin, Texas, 9–12 February, pp 136–143
Collins K (1998) Providing subject access to images: a study of user queries. Am Arch 61:36–55
Google Scholar
Datta R, Joshi D, Li J, Wang JZ (2008) Image retrieval: ideas, influences, and trends of the new age. ACM Comput Surv 40(2):5–60
Article Google Scholar
Demner-Fushman D (2008) Combining medical domain ontological knowledge and low-level image features for multimedia indexing. OntoImage 2008: 2nd International Language Resources for Content-Based Image Retrieval Workshop in conjunction with LREC’2008, pp 18–23
Fellbaum C (ed) (1998) WordNet: an electronic lexical database. MIT, Cambridge, MA
Gale W, Church K, Yarowsky D (1993) A method for disambiguation word senses in a large corpus. Computers and Humanities 26:415–439
Article Google Scholar
Grishman R, Sundheim B (Eds) (1995) Design of the MUC-6 evaluation. Sixth Message Understanding Conference (MUC-6), NIST, Morgan-Kaufmann, Columbia, MD, pp 1–11
Hatzivassiloglou V, Klavans JL, Eskin E (1999) Detecting text similarity over short passages: exploring linguistic feature combinations via machine learning. Proceedings of Empirical Methods in Natural Language Processing (EMNLP) and Very Large Corpora, MD, USA, pp 203–212
Hatzivassiloglou V, Gravano L, Maganti A (2000) An investigation of linguistic features and clustering algorithms for topical document clustering. Proceedings of the Annual Meeting of ACM-SIGIR, pp 224–231
Hearst M (1997) TextTiling: segmenting text into multi-paragraph subtopic passages. Comput Linguist 23(1):33–64
Google Scholar
Kan M, Klavans JL, McKeown KR (1998) Linear segmentation and segment relevance. Proceedings of the 6th International Workshop of Very Large Corpora (WVLC-6), Montréal, Québec, Canada, pp 197–205
Keister LH (1994) User types and queries: impact on image access systems. In: Fidel R, Hahn TB, Rasmussen E, Smith PJ (eds) Challenges in indexing electronic text and images. Learned Information for the American Society of Information Science, Medford, pp 7–22
Google Scholar
Klavans JL, Chodorow MS, Wacholder N (1990) From dictionary to knowledge base via taxonomy. Proceedings of the sixth conference of the University of Waterloo Centre for the New Oxford English Dictionary and Text Research: Electronic Text Research, University of Waterloo, Waterloo, Canada, pp 110–132
Klavans JL, Tzoukermann E (1996) Dictionaries and corpora: combining corpus and machine-readable dictionary data for building bilingual lexicons. Journal of Machine Translation 10(3–4):185–218
Google Scholar
Klein S, Simmons RF (1963) A computational approach to grammatical coding of English words. J Assoc Comput Mach 10(3):334–347
MATH Google Scholar
Lesk M (1986) Automatic sense disambiguation: how to tell a pine cone from an ice cream cone. Proceedings of the 1986 ACM SIGDOC Conference, pp 24–26
Lew MS (2000) Next-generation web searches for visual content. IEEE Computer 33:46–53
Google Scholar
Maron ME (1961) Automatic indexing: an experimental inquiry. J Assoc Comput Mach 8(3):404–417
MATH Google Scholar
Palmer M, Ng HT, Dang HT (2006) Evaluation. In: Edmonds P, Agirre E (eds) Word sense disambiguation: algorithms, applications, and trends. text, speech, and language technology series. Kluwer, The Netherlands
Google Scholar
Panofsky E (1962) Studies in iconology: humanistic themes in the art of the renaissance. Harper & Row, New York
Google Scholar
Passonneau R, Yano T, Lippincott T, Klavans J (2008) Functional semantic categories for art history text: human labeling and preliminary machine learning. Proceedings of the 3rd International Conference on Computer Vision Theory and Applications, Workshop on Metadata Mining for Image Understanding, pp 13–22
Pastra K, Saggion H, Wilks Y (2003) Intelligent indexing of crime-scene photographs. IEEE Intell Syst Their Appl 18(1):55–61
Article Google Scholar
Patwardhan S, Banerjee S, Pedersen T (2003) Using measures of semantic relatedness for word sense disambiguation. Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, pp 241–257
Rasmussen EM (1997) Indexing images. Annu Rev Inf Sci Technol 32:169–196
Google Scholar
Resnik R (1999) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95–130
MATH Google Scholar
Rorissa A, Iyer H (2008) Theories of cognition and image categorization: what category labels reveal about basic level theory. J Am Soc Inf Sci Technol 59(9):1383–1392
Article Google Scholar
Shatford S (1986) Analyzing the subject of a picture: a theoretical approach. Cat Classif Q 6(3):39–62
Article Google Scholar
Sidhu T, Klavans JL, Lin J (2007) Concept disambiguation for improved subject access using multiple knowledge sources. Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTech 2007), 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp 25–32
Tibbo HR (1994) Indexing for the humanities. J Am Soc Inf Sci 45(8):607–619
Article Google Scholar
Wilks Y, Catizone R (2002) What is lexical tuning? J Semant 19(2):167–190
Article Google Scholar
Yang Y, Liu X (1999) A re-examination of text categorization methods. Proceedings of the 22nd Annual International ACM SIGIR, pp 42–49
Yarowsky D (1994) Decision lists for lexical ambiguity resolution. Proceedings of ACL-94, Las Cruces, NM, pp 88–95
Yarowsky D (1992) Word-sense disambiguation using statistical models of Roget’s categories trained on large corpora. Proceedings of COLING’92 Conference, pp 454–460

Download references

Acknowledgements

We acknowledge the Program Office for Scholarly Communications of the Andrew W. Mellon Foundation, especially Don Waters and Suzanne Lodato; Dr. Murtha Baca, director of the Getty Vocabulary Program and Digital Resource Management, Getty Research Institute for providing us with research access to resources; cataloging and domain expert Angela Giral; collections partners, including Jeff Cohen, Bryn Mawr College and University of Pennsylvania for the vernacular architecture collection; Jack Sullivan, University of Maryland for landscape architecture; the Senate Museum and Library; and ARTStor. Finally, Joan Beaudoin (Drexel), Laura Jaeneman (Drexel), and Brooke Rosenblatt (the Phillips Gallery) helped with annotation, collections and user studies.

Author information

Authors and Affiliations

iSchool, University of Maryland, College Park, MD, USA
Judith L. Klavans, Carolyn Sheffield, Jimmy Lin, Tandeep Sidhu & Dagobert Soergel
Computational Linguistics and Information Processing Laboratory (CLIP), University of Maryland, College Park, MD, USA
Judith L. Klavans & Jimmy Lin
University of Maryland Institute for Advanced Computer Science (UMIACS), University of Maryland, College Park, MD, USA
Judith L. Klavans & Jimmy Lin
College of Information Science and Technology, Drexel University, Philadelphia, PA, USA
Eileen Abels
Center for Computational Learning Systems, Columbia University, New York, NY, USA
Rebecca Passonneau

Authors

Judith L. Klavans
View author publications
You can also search for this author inPubMed Google Scholar
Carolyn Sheffield
View author publications
You can also search for this author inPubMed Google Scholar
Eileen Abels
View author publications
You can also search for this author inPubMed Google Scholar
Jimmy Lin
View author publications
You can also search for this author inPubMed Google Scholar
Rebecca Passonneau
View author publications
You can also search for this author inPubMed Google Scholar
Tandeep Sidhu
View author publications
You can also search for this author inPubMed Google Scholar
Dagobert Soergel
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Carolyn Sheffield.

Additional information

This project, funded by the Andrew W. Mellon Foundation, was initiated at the Center for Research on Information Access at Columbia University and is currently based at the University of Maryland.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Klavans, J.L., Sheffield, C., Abels, E. et al. Computational linguistics for metadata building (CLiMB): using text mining for the automatic identification, categorization, and disambiguation of subject terms for image metadata. Multimed Tools Appl 42, 115–138 (2009). https://doi.org/10.1007/s11042-008-0253-9

Download citation

Published: 08 November 2008
Issue Date: March 2009
DOI: https://doi.org/10.1007/s11042-008-0253-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Computational linguistics for metadata building (CLiMB): using text mining for the automatic identification, categorization, and disambiguation of subject terms for image metadata

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Towards Automatic Cataloging of Image and Textual Collections with Wikipedia

Developing and Aligning a Detailed Controlled Vocabulary for Artwork

NLP and Archaeology: A View from a Digital Archive

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now