Skip to main content
Log in

A temporal context model for boosting video annotation

  • Research Paper
  • Progress of Projects Supported by NSFC
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

In this paper, we propose a new method to model the temporal context for boosting video annotation accuracy. The motivation of our idea mainly comes from the fact that temporally continuous shots in video are generally with relevant content, so that the performance of video annotation could be comparably boosted by mining the temporal dependency between shots in video. Based on this consideration, we propose a temporal context model to mine the redundant information between shots. By connecting our model with conditional random field and borrowing the learning and inference approaches from it, we could obtain the refined probability of a concept occurring in the shot, which is the leverage of temporal context information and initial output of video annotation. Comparing with existing methods for temporal context mining of video annotation, our model could capture different kinds of shot dependency more accurately to improve the video annotation performance. Furthermore, our model is relatively simple and efficient, which is important for the applications which have large scale data to process. Extensive experimental results on the widely used TRECVID datasets exhibit the effectiveness of our method for improving video annotation accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Smeaton A F, Over P, Kraaij W. Evaluation campaigns and trecvid. In: Proceedings of the 8th ACM SIGMM International Workshop on Multimedia Information Retrieval, Santa Barbara, 2006. 321–330

  2. Snoek C G M, Worring M, van Gemert J C, et al. The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of the 14th ACM International Conference on Multimedia, Santa Barbara, 2006. 421–430

  3. Yanagawa A, Chang S F, Kennedy L, et al. Columbia university’s baseline detectors for 374 lscom semantic visual concepts. Technical Report, Columbia University. 2007

    Google Scholar 

  4. Jiang Y G, Ngo C W, Yang J. Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval, Amsterdam, 2007. 494–501

  5. Ngo C W, Jiang Y G, Wei X Y, et al. Vireo/dvmm at trecvid 2009: High-level feature extraction, automatic video search, and content-based copy detection. TREC Video Retrieval Evaluation, Gaithersburg, 2009

    Google Scholar 

  6. Jiang Y G, Wang J, Chang S F, et al. Domain adaptive semantic diffusion for large scale context-based video annotation. In: Proceedings of the 12th IEEE International Conference on Computer Vision, Kyoto, 2009. 1420–1427

  7. Kennedy L, Chang S F. A reranking approach for context-based concept fusion in video indexing and retrieval. In: Proceedings of the 11th IEEE International Conference on Computer Vision, 2007. 333–340

  8. Yan R, Chen M Y, Hauptmann A G. Mining relationship between video concepts using probabilistic graphical model. In: Proceedings of IEEE International Conference on Multimedia and Expo, Toronto, 2006. 301–304

  9. Torralba A. Contextual priming for object detection. Int J Comput Vis, 2003, 53: 169–191

    Article  Google Scholar 

  10. Shotton J, Winn J, Rother C, et al. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling appearance, shape and context. Int J Comput Vis, 2009, 81: 2–23

    Article  Google Scholar 

  11. Wolf L, Bileschi S. A critical view of context. Int J Comput Vis, 2006, 69: 251–261

    Article  Google Scholar 

  12. Rabinovich A, Vedaldi A, Galleguillos C, et al. Objects in context. In: Proceedings of the 11th IEEE International Conference on Computer Vision, Rio de Janeiro, 2007. 1–8

  13. Galleguillos C, Rabinovich A, Belongie S. Object categorization using co-occurrence, location and appearance. In: IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 2008. 1–8

  14. Yuan J H, Li J M, Zhang B. Exploiting spatial context constraints for automatic image region annotation. In: Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, 2007. 595–604

  15. Yang J, Hauptmann A G. Exploring temporal consistency for video retrieval and analysis. In: Proceedings of the 8th ACM SIGMM International Workshop on Multimedia Information Retrieval, 2006. 33–42

  16. Qi G J, Hua X S, Rui Y, et al. Correlative multi-label video annotation with temporal kernels. ACM Trans Multimed Comput Commun Appl, 2008, 5: 3–29

    Article  Google Scholar 

  17. Liu K H, Weng M F, Tseng C Y, et al. Association and temporal rule mining for post-processing of semantic concept detection in video. IEEE Trans Multimedia, 2008, 10: 240–251

    Article  Google Scholar 

  18. Weng M F, Chuang Y Y. Multi-cue fusion for semantic video indexing. In: Proceedings of the 16th ACM International Conference on Multimedia, Vancouver, 2008. 71–80

  19. Qi G J, Hua X S, Rui Y, et al. Correlative multi-label video annotation. In: Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, 2007. 17–26

  20. Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, 2001. 282–289

  21. Sutton C, McCallum A. An introduction to conditional random fields for relational learning. In: Getoor L, Taskar B, eds. Introduction to Statistical Relational Learning. Cambridge: MIT Press, 2007

    Google Scholar 

  22. Zhu J, Nie Z Q, Wen J R, et al. 2d conditional random fields for web information extraction. In: Proceedings of the 22nd International Conference on Machine Learning, Bonn, 2005. 1044–1051

  23. Jiang Y G, Ngo C W, Yang J. VIREO-374: Keypoint-based LSCOMSemantic Concept Detectors. http://vireo.cs.cityu.edu.hk/research/vireo374/

  24. Yilmaz E, Aslam J A. Estimating average precision with incomplete and imperfect judgments. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, Arlington, 2006. 102–111

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to YuXin Peng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yi, J., Peng, Y. & Xiao, J. A temporal context model for boosting video annotation. Sci. China Inf. Sci. 56, 1–14 (2013). https://doi.org/10.1007/s11432-012-4720-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11432-012-4720-6

Keywords

Navigation