Abstract
This paper presents an approach to designing and implementing extensible computational models for perceiving systems based on a knowledge-driven joint inference approach. These models can integrate different sources of information both horizontally (multi-modal and temporal fusion) and vertically (bottom–up, top–down) by incorporating prior hierarchical knowledge expressed as an extensible ontology.
Two implementations of this approach are presented. The first consists of a content-based image retrieval system that allows users to search image databases using an ontological query language. Queries are parsed using a probabilistic grammar and Bayesian networks to map high-level concepts onto low-level image descriptors, thereby bridging the ‘semantic gap’ between users and the retrieval system. The second application extends the notion of ontological languages to video event detection. It is shown how effective high-level state and event recognition mechanisms can be learned from a set of annotated training sequences by incorporating syntactic and semantic constraints represented by an ontology.
Similar content being viewed by others
References
Abella, A.: From imagery to salience: Locative expressions in context. Ph.D. Thesis, University of Columbia (1995)
Abella, A., Kender, J.: From pictures to words: Generating locative descriptions of objects in an image. In: ARPA94, pp II:909–918 (1994)
Barnard, K., Duygulu, P., Forsyth, D.: Clustering art. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (2001)
Barnard, K., Duygulu, P., Forsyth, D., Freitas, N., Blei, D., Jordan, M.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)
Barnard, K., Forsyth, D.: Learning the semantics of words and pictures. In: Proceedings of the International Conference on Computer Vision (2001)
Bobick, A., Richards, W.: Classifying objects from visual information. Technical Report, MIT AI Lab (1986)
Bunke, H., Pasche, D.: Parsing multivalued strings and its application to image and waveform recognition, structural pattern analysis. World Scientific Publishing, Singapore (1990)
Buxton, H., Walker, N.: Query based visual analysis: Spatio-temporal reasoning in computer vision. Vis. Comput. 6(4), 247–254 (1988)
Chen, Y., Rui, Y., Huang, T.: JPDAF based HMM for real-time contour tracking. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (2001)
Chua, T.S., Teo, K.C., Ooi, B.C., Tan, K.L.: Using domain knowledge in querying image databases. In: Proceeding of the International Conference on Multimedia Modeling (1996)
Cooper, G., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learn. 9, 309–347 (1992)
Crowley, J., Coutaz, J., Rey, G., Reignier, P.: Perceptual components for context aware computing. In: Proceedings of the Ubicomp 2002 (2002)
Darrell, T., Gordon, G., Harville, M., Woodfill, J.: Integrated person tracking using stereo, color, and pattern detection. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (1998)
Dennett, D.: Minds, Machines, and Evolution, pp. 129–151. Cambridge University Press, Cambridge (1984)
Duygulu, P., Barnard, K., De Freitas, J., Forsyth, D.: Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: Proceedings of the European Conference on Computer Vision (2002)
Ekin, A., Tekalp, A., Mehrotra, R.: Semantic video querying using an integrated semantic-syntactic model. In: Proceeding of the International Conference on Image Processing (2002)
Friedman, N., Koller, D.: Being Bayesian about network structure. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence (2000)
Glöckner, I., Knoll, A.: Fuzzy quantifiers for processing natural-language queries in content-based multimedia. Technical Report TR97-05, Faculty of Technology, University of Bielefeld, Germany (1997)
Guarino, N., Masolo, C., Vetere, G.: Ontoseek: Content-based access to the web. IEEE Intell. Syst. 14(3), 70–80 (1999)
Harnad, S.: The symbol grounding problem. Physica D 42, 335–346 (1990)
Heckerman, D.: A tutorial on learning with Bayesian networks. In: Jordan, M. (ed.) Learning in Graphical Models. MIT Press, Massachusetts (1998)
Hongeng, S., Nevatia, R.: Large-scale event detection using semi-hidden markov models. In: Proceeding of the International Conference on Computer Vision (2003)
Hoogs, A., Rittscher, J., Stein, G., Schmiederer, J.: Video content annotation using visual analysis and large semantic knowledgebase. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (2003)
Hu, M.: Visual pattern recognition by moment invariants. IRA Trans. Inform. Theory 17(2), 179–187 (1962)
Jaimes, A., Chang, S.: A conceptual framework for indexing visual information at multiple levels. In: IS&T SPIE Internet Imaging (2000)
Jaimes, A., Chang, S.F.: Integrating multiple classifiers in visual object detectors learned from user input. In: Proceedings of the Asian Conference on Computer Vision (2000)
Jensen, F.: An Introduction to Bayesian Networks. Springer-Verlag, New York (1996)
Jordan, M. (ed.): Learning in Graphical Models. MIT Press, Massachusetts (1999)
Katz, B., Lin, J., Stauffer, C., Grimson, E.: Answering questions about moving objects in surveillance videos. In: Proceedings of the AAAI Spring Symposium on New Directions in Question Answering (2003)
Kohler, C.: Selecting ghosts and queues from a car trackers output using a spatio-temporal query language. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (2004)
Kokar, M., Wang, J.: An example of using ontologies and symbolic information in automatic target recognition. In: Proceedings of the SPIE Sensor Fusion: Architectures, Algorithms, and Applications VI, pp. 40–50 (2002)
Kruschwitz, U.: Exploiting structure for intelligent web search. In: Proceeding of the International Conference on System Sciences. Maui, Hawaii (2001)
Lalmas, M.: Information retrieval and Dempster-Shafer's theory of evidence. In: Applications of Uncertainty Formalisms, pp. 157–177. Springer, Berlin Heidelberg New York (1998)
Lim, J.: Learnable visual keywords for image classification. In: Proceedings of the ACM International Conference on Digital Libraries (1999)
Mezaris, V., Kompatsiaris, I., Strintzis, M.: An ontology approach to object-based image retrieval. In: Proceedings of the International Conference on Image Processing (2003)
Miller, G., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to Wordnet: an on-line lexical database. Int. J. Lexicogr. 3, 235–244 (1990)
Mojsilovic, A., Gomes, J., Rogowitz, B.: Isee: Perceptual features for image library navigation. In: Proceedings of the 2002 SPIE Human Vision and Electronic Imaging (2002)
Mueller, H., Marchand-Maillet, S., Pun, T.: The truth about Corel—evaluation in image retrieval. In: Proceedings of the Conference on Image and Video Retrieval, LNCS 2383, pp. 38–50. Springer, Berlin Heidelberg, New York (2002)
Mueller, H., Mueller, W., Squire, D., Marchand-Maillet, S., Pun, T.: Performance evaluation in content-based image retrieval: Overview and proposals. Pattern Recog. Lett. 22(5), 593–601 (2001)
Murphy, K.: The Bayes net toolbox for matlab. Comput. Sci. Stat. 33 (2001)
Nepal, S., Ramakrishna, M., Thom, J.: A fuzzy object query language (FOQL) for image databases. In: Proceedings of the Intenational Conference on Database Systems for Advanced Applications (1999)
Nevatia, R., Hobbs, J., Bolles, B.: An ontology for video event representation. In: Proceedings of the International Workshop on Detection and Recognition of Events in Video (at CVPR04) (2004)
Nevatia, R., Zhao, T., Hongeng, S.: Hierarchical language-based representation of events in video streams. In: Proceedings of the IEEE Workshop on Event Mining (2003)
Park, S., Aggarwal, J.: Event semantics in two-person interactions. In: Proceeding of the International Conference on Pattern Recognition (2004)
Parsons, S., Hunter, A.: A review of uncertainty handling formalisms. In: Applications of Uncertainty Formalisms, pp. 8–37. Springer, Berlin Heidelberg New York (1998)
Pastra, K., Saggion, H., Wilks, Y.: Extracting relational facts for indexing and retrieval of crime-scene photographs. IEEE Intell. Syst. 18(1), 55–61 (2002)
Pfeffer, A., Koller, D.: Semantics and inference for recursive probability models. In: Proceedings of the AAAI'00 (2000)
Pfeffer, A., Koller, D., Milch, B., Takusagawa, K.: SPOOK: A system for probabilistic object-oriented knowledge representation. In: Proceeding of the Conference on Uncertainty in AI (1999)
Rodden, K.: Evaluating similarity-based visualisations as interfaces for image browsing. Ph.D. Thesis, Cambridge University Computer Laboratory (2001)
Rowe, N., Frew, B.: Automatic Classification of Objects in Captioned Descriptive Photographs for Retrieval, Chap. 4, pp. 65–79. AAAI Press, California (1997)
Roweis, S., Ghahramani, Z.: A unifying review of linear Gaussian models. Neural Comput. 11(2), 305–345 (1999)
Roy, D.: Learning visually grounded words and syntax of natural spoken language. Evol. Commun. 4, (2001)
Roy, D.: A trainable visually-grounded spoken language generation system. In: Proceedings of the International Conference of Spoken Language Processing (2002)
Sherrah, J., Gong, S.: Tracking discontinuous motion using Bayesian inference. In: Proceeding of the European Conference on Computer Vision, pp. 150–166 (2000)
Sherrah, J., Gong, S.: Continuous global evidence-based Bayesian modality fusion for simultaneous tracking of multiple objects. In: Proceedings of the International Conference on Computer Vision (2001)
Sinclair, D.: Voronoi seeded colour image segmentation. Technical Report TR99-04, AT&T Laboratories Cambridge (1999)
Sinclair, D.: Smooth region structure: folds, domes, bowls, ridges, valleys and slopes. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 389–394 (2000)
Smith, P.: Edge-based motion segmentation. Ph.D. Thesis, Cambridge University Engineering Department (2001)
Socher, G., Sagerer, G., Perona, P.: Bayesian reasoning on qualitative descriptions from images and speech. Image Vis. Comput. 18(2), 155–172 (2000)
Spengler, M., Schiele, B.: Towards robust multi-cue integration for visual tracking. Lect. Notes Comput. Sci. 2095, 93–106 (2001)
Town, C.: Ontology based visual information processing. Ph.D. Thesis, University of Cambridge (2004)
Town, C.: Ontology-driven Bayesian networks for dynamic scene understanding. In: Proceedings of the International Workshop on Detection and Recognition of Events in Video (at CVPR04) (2004)
Town, C., Sinclair, D.: Content based image retrieval using semantic visual categories. Technical Report MV01-211, Society for Manufacturing Engineers (2001)
Town, C., Sinclair, D.: Ontological query language for content based image retrieval. In: Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries, pp. 75–81 (2001)
Town, C., Sinclair, D.: Language-based querying of image collections on the basis of an extensible ontology. Int. J. Image Vis. Comput. 22(3), 251–267 (2004)
Tsai, W., Fu, K.: Attributed grammars—a tool for combining syntactic and statistical approaches to pattern recognition. IEEE Trans. Syst. Man Cybernetics SMC-10(12) (1980)
Tsotsos, J., Mylopoulos, J., Covvey, H., Zucker, S.: A framework for visual motion understanding. IEEE Trans. Pattern Anal. Mach. Intell. Special Issue on Computer Analysis of Time-Varying Imagery, 563–573 (1980)
Wachsmuth, S., Socher, G., Brandt-Pook, H., Kummert, F., Sagerer, G.: Integration of vision and speech understanding using Bayesian networks. Videre J. Comput. Vis. Res. 1(4) (2000)
Wu, Y., Huang, T.: A co-inference approach to robust visual tracking. In: Proceedings of the International Conference on Computer Vision (2001)
Xu, C., Prince, J.: Snakes, shapes, and gradient vector flow. IEEE Trans. Image Process. 7(3), 359–369 (1998)
Zhao, R., Grosky, W.: From features to semantics: Some preliminary results. In: Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 679–682 (2000)
Author information
Authors and Affiliations
Corresponding author
Additional information
Dr Christopher Town is a Research Fellow at Wolfson College in the University of Cambridge. He completed undergraduate and doctoral studies in computer science at Trinity College, Cambridge, where he was a Senior Scholar, ProJuvis Scholar, IEE Scholar, Research Scholar, and Rouse Ball Scholar. During his PhD he was sponsored by AT&T Labs Research through an Industrial Fellowship from the Royal Commission for the Exhibition of 1851. His PhD thesis was awarded a prize in the 2005 Distinguished Dissertation Competition by the UK Conference of Professors and Heads of Computing in conjunction with the British Computer Society. He published 15 papers during his doctoral research and was awarded a best paper prize at the International Conference on Vision Systems in 2003. Prior to starting his PhD, he carried out research at AT&T Labs in Cambridge and in the USA. Dr Town's main research interests are in the area of computer vision, particularly regarding the way in which methods from information retrieval and machine learning can be applied to solve vision problems such as content based image retrieval and classification.
Rights and permissions
About this article
Cite this article
Town, C. Ontological inference for image and video analysis. Machine Vision and Applications 17, 94–115 (2006). https://doi.org/10.1007/s00138-006-0017-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00138-006-0017-3