Skip to main content
Log in

Ontological inference for image and video analysis

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

This paper presents an approach to designing and implementing extensible computational models for perceiving systems based on a knowledge-driven joint inference approach. These models can integrate different sources of information both horizontally (multi-modal and temporal fusion) and vertically (bottom–up, top–down) by incorporating prior hierarchical knowledge expressed as an extensible ontology.

Two implementations of this approach are presented. The first consists of a content-based image retrieval system that allows users to search image databases using an ontological query language. Queries are parsed using a probabilistic grammar and Bayesian networks to map high-level concepts onto low-level image descriptors, thereby bridging the ‘semantic gap’ between users and the retrieval system. The second application extends the notion of ontological languages to video event detection. It is shown how effective high-level state and event recognition mechanisms can be learned from a set of annotated training sequences by incorporating syntactic and semantic constraints represented by an ontology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abella, A.: From imagery to salience: Locative expressions in context. Ph.D. Thesis, University of Columbia (1995)

  2. Abella, A., Kender, J.: From pictures to words: Generating locative descriptions of objects in an image. In: ARPA94, pp II:909–918 (1994)

  3. Barnard, K., Duygulu, P., Forsyth, D.: Clustering art. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (2001)

  4. Barnard, K., Duygulu, P., Forsyth, D., Freitas, N., Blei, D., Jordan, M.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)

    Article  MATH  Google Scholar 

  5. Barnard, K., Forsyth, D.: Learning the semantics of words and pictures. In: Proceedings of the International Conference on Computer Vision (2001)

  6. Bobick, A., Richards, W.: Classifying objects from visual information. Technical Report, MIT AI Lab (1986)

  7. Bunke, H., Pasche, D.: Parsing multivalued strings and its application to image and waveform recognition, structural pattern analysis. World Scientific Publishing, Singapore (1990)

  8. Buxton, H., Walker, N.: Query based visual analysis: Spatio-temporal reasoning in computer vision. Vis. Comput. 6(4), 247–254 (1988)

    Article  Google Scholar 

  9. Chen, Y., Rui, Y., Huang, T.: JPDAF based HMM for real-time contour tracking. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (2001)

  10. Chua, T.S., Teo, K.C., Ooi, B.C., Tan, K.L.: Using domain knowledge in querying image databases. In: Proceeding of the International Conference on Multimedia Modeling (1996)

  11. Cooper, G., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learn. 9, 309–347 (1992)

    MATH  Google Scholar 

  12. Crowley, J., Coutaz, J., Rey, G., Reignier, P.: Perceptual components for context aware computing. In: Proceedings of the Ubicomp 2002 (2002)

  13. Darrell, T., Gordon, G., Harville, M., Woodfill, J.: Integrated person tracking using stereo, color, and pattern detection. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (1998)

  14. Dennett, D.: Minds, Machines, and Evolution, pp. 129–151. Cambridge University Press, Cambridge (1984)

  15. Duygulu, P., Barnard, K., De Freitas, J., Forsyth, D.: Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: Proceedings of the European Conference on Computer Vision (2002)

  16. Ekin, A., Tekalp, A., Mehrotra, R.: Semantic video querying using an integrated semantic-syntactic model. In: Proceeding of the International Conference on Image Processing (2002)

  17. Friedman, N., Koller, D.: Being Bayesian about network structure. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence (2000)

  18. Glöckner, I., Knoll, A.: Fuzzy quantifiers for processing natural-language queries in content-based multimedia. Technical Report TR97-05, Faculty of Technology, University of Bielefeld, Germany (1997)

  19. Guarino, N., Masolo, C., Vetere, G.: Ontoseek: Content-based access to the web. IEEE Intell. Syst. 14(3), 70–80 (1999)

    Article  Google Scholar 

  20. Harnad, S.: The symbol grounding problem. Physica D 42, 335–346 (1990)

    Article  Google Scholar 

  21. Heckerman, D.: A tutorial on learning with Bayesian networks. In: Jordan, M. (ed.) Learning in Graphical Models. MIT Press, Massachusetts (1998)

  22. Hongeng, S., Nevatia, R.: Large-scale event detection using semi-hidden markov models. In: Proceeding of the International Conference on Computer Vision (2003)

  23. Hoogs, A., Rittscher, J., Stein, G., Schmiederer, J.: Video content annotation using visual analysis and large semantic knowledgebase. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (2003)

  24. Hu, M.: Visual pattern recognition by moment invariants. IRA Trans. Inform. Theory 17(2), 179–187 (1962)

    Google Scholar 

  25. Jaimes, A., Chang, S.: A conceptual framework for indexing visual information at multiple levels. In: IS&T SPIE Internet Imaging (2000)

  26. Jaimes, A., Chang, S.F.: Integrating multiple classifiers in visual object detectors learned from user input. In: Proceedings of the Asian Conference on Computer Vision (2000)

  27. Jensen, F.: An Introduction to Bayesian Networks. Springer-Verlag, New York (1996)

    Google Scholar 

  28. Jordan, M. (ed.): Learning in Graphical Models. MIT Press, Massachusetts (1999)

  29. Katz, B., Lin, J., Stauffer, C., Grimson, E.: Answering questions about moving objects in surveillance videos. In: Proceedings of the AAAI Spring Symposium on New Directions in Question Answering (2003)

  30. Kohler, C.: Selecting ghosts and queues from a car trackers output using a spatio-temporal query language. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (2004)

  31. Kokar, M., Wang, J.: An example of using ontologies and symbolic information in automatic target recognition. In: Proceedings of the SPIE Sensor Fusion: Architectures, Algorithms, and Applications VI, pp. 40–50 (2002)

  32. Kruschwitz, U.: Exploiting structure for intelligent web search. In: Proceeding of the International Conference on System Sciences. Maui, Hawaii (2001)

  33. Lalmas, M.: Information retrieval and Dempster-Shafer's theory of evidence. In: Applications of Uncertainty Formalisms, pp. 157–177. Springer, Berlin Heidelberg New York (1998)

  34. Lim, J.: Learnable visual keywords for image classification. In: Proceedings of the ACM International Conference on Digital Libraries (1999)

  35. Mezaris, V., Kompatsiaris, I., Strintzis, M.: An ontology approach to object-based image retrieval. In: Proceedings of the International Conference on Image Processing (2003)

  36. Miller, G., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to Wordnet: an on-line lexical database. Int. J. Lexicogr. 3, 235–244 (1990)

    Article  Google Scholar 

  37. Mojsilovic, A., Gomes, J., Rogowitz, B.: Isee: Perceptual features for image library navigation. In: Proceedings of the 2002 SPIE Human Vision and Electronic Imaging (2002)

  38. Mueller, H., Marchand-Maillet, S., Pun, T.: The truth about Corel—evaluation in image retrieval. In: Proceedings of the Conference on Image and Video Retrieval, LNCS 2383, pp. 38–50. Springer, Berlin Heidelberg, New York (2002)

  39. Mueller, H., Mueller, W., Squire, D., Marchand-Maillet, S., Pun, T.: Performance evaluation in content-based image retrieval: Overview and proposals. Pattern Recog. Lett. 22(5), 593–601 (2001)

    MATH  Google Scholar 

  40. Murphy, K.: The Bayes net toolbox for matlab. Comput. Sci. Stat. 33 (2001)

  41. Nepal, S., Ramakrishna, M., Thom, J.: A fuzzy object query language (FOQL) for image databases. In: Proceedings of the Intenational Conference on Database Systems for Advanced Applications (1999)

  42. Nevatia, R., Hobbs, J., Bolles, B.: An ontology for video event representation. In: Proceedings of the International Workshop on Detection and Recognition of Events in Video (at CVPR04) (2004)

  43. Nevatia, R., Zhao, T., Hongeng, S.: Hierarchical language-based representation of events in video streams. In: Proceedings of the IEEE Workshop on Event Mining (2003)

  44. Park, S., Aggarwal, J.: Event semantics in two-person interactions. In: Proceeding of the International Conference on Pattern Recognition (2004)

  45. Parsons, S., Hunter, A.: A review of uncertainty handling formalisms. In: Applications of Uncertainty Formalisms, pp. 8–37. Springer, Berlin Heidelberg New York (1998)

  46. Pastra, K., Saggion, H., Wilks, Y.: Extracting relational facts for indexing and retrieval of crime-scene photographs. IEEE Intell. Syst. 18(1), 55–61 (2002)

    Article  Google Scholar 

  47. Pfeffer, A., Koller, D.: Semantics and inference for recursive probability models. In: Proceedings of the AAAI'00 (2000)

  48. Pfeffer, A., Koller, D., Milch, B., Takusagawa, K.: SPOOK: A system for probabilistic object-oriented knowledge representation. In: Proceeding of the Conference on Uncertainty in AI (1999)

  49. Rodden, K.: Evaluating similarity-based visualisations as interfaces for image browsing. Ph.D. Thesis, Cambridge University Computer Laboratory (2001)

  50. Rowe, N., Frew, B.: Automatic Classification of Objects in Captioned Descriptive Photographs for Retrieval, Chap. 4, pp. 65–79. AAAI Press, California (1997)

  51. Roweis, S., Ghahramani, Z.: A unifying review of linear Gaussian models. Neural Comput. 11(2), 305–345 (1999)

    Google Scholar 

  52. Roy, D.: Learning visually grounded words and syntax of natural spoken language. Evol. Commun. 4, (2001)

  53. Roy, D.: A trainable visually-grounded spoken language generation system. In: Proceedings of the International Conference of Spoken Language Processing (2002)

  54. Sherrah, J., Gong, S.: Tracking discontinuous motion using Bayesian inference. In: Proceeding of the European Conference on Computer Vision, pp. 150–166 (2000)

  55. Sherrah, J., Gong, S.: Continuous global evidence-based Bayesian modality fusion for simultaneous tracking of multiple objects. In: Proceedings of the International Conference on Computer Vision (2001)

  56. Sinclair, D.: Voronoi seeded colour image segmentation. Technical Report TR99-04, AT&T Laboratories Cambridge (1999)

  57. Sinclair, D.: Smooth region structure: folds, domes, bowls, ridges, valleys and slopes. In: Proceedings of the Conference on Computer Vision and Pattern Recognition, pp. 389–394 (2000)

  58. Smith, P.: Edge-based motion segmentation. Ph.D. Thesis, Cambridge University Engineering Department (2001)

  59. Socher, G., Sagerer, G., Perona, P.: Bayesian reasoning on qualitative descriptions from images and speech. Image Vis. Comput. 18(2), 155–172 (2000)

    Article  Google Scholar 

  60. Spengler, M., Schiele, B.: Towards robust multi-cue integration for visual tracking. Lect. Notes Comput. Sci. 2095, 93–106 (2001)

    Article  Google Scholar 

  61. Town, C.: Ontology based visual information processing. Ph.D. Thesis, University of Cambridge (2004)

  62. Town, C.: Ontology-driven Bayesian networks for dynamic scene understanding. In: Proceedings of the International Workshop on Detection and Recognition of Events in Video (at CVPR04) (2004)

  63. Town, C., Sinclair, D.: Content based image retrieval using semantic visual categories. Technical Report MV01-211, Society for Manufacturing Engineers (2001)

  64. Town, C., Sinclair, D.: Ontological query language for content based image retrieval. In: Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries, pp. 75–81 (2001)

  65. Town, C., Sinclair, D.: Language-based querying of image collections on the basis of an extensible ontology. Int. J. Image Vis. Comput. 22(3), 251–267 (2004)

    Article  Google Scholar 

  66. Tsai, W., Fu, K.: Attributed grammars—a tool for combining syntactic and statistical approaches to pattern recognition. IEEE Trans. Syst. Man Cybernetics SMC-10(12) (1980)

  67. Tsotsos, J., Mylopoulos, J., Covvey, H., Zucker, S.: A framework for visual motion understanding. IEEE Trans. Pattern Anal. Mach. Intell. Special Issue on Computer Analysis of Time-Varying Imagery, 563–573 (1980)

  68. Wachsmuth, S., Socher, G., Brandt-Pook, H., Kummert, F., Sagerer, G.: Integration of vision and speech understanding using Bayesian networks. Videre J. Comput. Vis. Res. 1(4) (2000)

  69. Wu, Y., Huang, T.: A co-inference approach to robust visual tracking. In: Proceedings of the International Conference on Computer Vision (2001)

  70. Xu, C., Prince, J.: Snakes, shapes, and gradient vector flow. IEEE Trans. Image Process. 7(3), 359–369 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  71. Zhao, R., Grosky, W.: From features to semantics: Some preliminary results. In: Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 679–682 (2000)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christopher Town.

Additional information

Dr Christopher Town is a Research Fellow at Wolfson College in the University of Cambridge. He completed undergraduate and doctoral studies in computer science at Trinity College, Cambridge, where he was a Senior Scholar, ProJuvis Scholar, IEE Scholar, Research Scholar, and Rouse Ball Scholar. During his PhD he was sponsored by AT&T Labs Research through an Industrial Fellowship from the Royal Commission for the Exhibition of 1851. His PhD thesis was awarded a prize in the 2005 Distinguished Dissertation Competition by the UK Conference of Professors and Heads of Computing in conjunction with the British Computer Society. He published 15 papers during his doctoral research and was awarded a best paper prize at the International Conference on Vision Systems in 2003. Prior to starting his PhD, he carried out research at AT&T Labs in Cambridge and in the USA. Dr Town's main research interests are in the area of computer vision, particularly regarding the way in which methods from information retrieval and machine learning can be applied to solve vision problems such as content based image retrieval and classification.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Town, C. Ontological inference for image and video analysis. Machine Vision and Applications 17, 94–115 (2006). https://doi.org/10.1007/s00138-006-0017-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00138-006-0017-3

Navigation