Abstract
In this paper, we propose a new method to model the temporal context for boosting video annotation accuracy. The motivation of our idea mainly comes from the fact that temporally continuous shots in video are generally with relevant content, so that the performance of video annotation could be comparably boosted by mining the temporal dependency between shots in video. Based on this consideration, we propose a temporal context model to mine the redundant information between shots. By connecting our model with conditional random field and borrowing the learning and inference approaches from it, we could obtain the refined probability of a concept occurring in the shot, which is the leverage of temporal context information and initial output of video annotation. Comparing with existing methods for temporal context mining of video annotation, our model could capture different kinds of shot dependency more accurately to improve the video annotation performance. Furthermore, our model is relatively simple and efficient, which is important for the applications which have large scale data to process. Extensive experimental results on the widely used TRECVID datasets exhibit the effectiveness of our method for improving video annotation accuracy.
Similar content being viewed by others
References
Smeaton A F, Over P, Kraaij W. Evaluation campaigns and trecvid. In: Proceedings of the 8th ACM SIGMM International Workshop on Multimedia Information Retrieval, Santa Barbara, 2006. 321–330
Snoek C G M, Worring M, van Gemert J C, et al. The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of the 14th ACM International Conference on Multimedia, Santa Barbara, 2006. 421–430
Yanagawa A, Chang S F, Kennedy L, et al. Columbia university’s baseline detectors for 374 lscom semantic visual concepts. Technical Report, Columbia University. 2007
Jiang Y G, Ngo C W, Yang J. Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval, Amsterdam, 2007. 494–501
Ngo C W, Jiang Y G, Wei X Y, et al. Vireo/dvmm at trecvid 2009: High-level feature extraction, automatic video search, and content-based copy detection. TREC Video Retrieval Evaluation, Gaithersburg, 2009
Jiang Y G, Wang J, Chang S F, et al. Domain adaptive semantic diffusion for large scale context-based video annotation. In: Proceedings of the 12th IEEE International Conference on Computer Vision, Kyoto, 2009. 1420–1427
Kennedy L, Chang S F. A reranking approach for context-based concept fusion in video indexing and retrieval. In: Proceedings of the 11th IEEE International Conference on Computer Vision, 2007. 333–340
Yan R, Chen M Y, Hauptmann A G. Mining relationship between video concepts using probabilistic graphical model. In: Proceedings of IEEE International Conference on Multimedia and Expo, Toronto, 2006. 301–304
Torralba A. Contextual priming for object detection. Int J Comput Vis, 2003, 53: 169–191
Shotton J, Winn J, Rother C, et al. Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling appearance, shape and context. Int J Comput Vis, 2009, 81: 2–23
Wolf L, Bileschi S. A critical view of context. Int J Comput Vis, 2006, 69: 251–261
Rabinovich A, Vedaldi A, Galleguillos C, et al. Objects in context. In: Proceedings of the 11th IEEE International Conference on Computer Vision, Rio de Janeiro, 2007. 1–8
Galleguillos C, Rabinovich A, Belongie S. Object categorization using co-occurrence, location and appearance. In: IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 2008. 1–8
Yuan J H, Li J M, Zhang B. Exploiting spatial context constraints for automatic image region annotation. In: Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, 2007. 595–604
Yang J, Hauptmann A G. Exploring temporal consistency for video retrieval and analysis. In: Proceedings of the 8th ACM SIGMM International Workshop on Multimedia Information Retrieval, 2006. 33–42
Qi G J, Hua X S, Rui Y, et al. Correlative multi-label video annotation with temporal kernels. ACM Trans Multimed Comput Commun Appl, 2008, 5: 3–29
Liu K H, Weng M F, Tseng C Y, et al. Association and temporal rule mining for post-processing of semantic concept detection in video. IEEE Trans Multimedia, 2008, 10: 240–251
Weng M F, Chuang Y Y. Multi-cue fusion for semantic video indexing. In: Proceedings of the 16th ACM International Conference on Multimedia, Vancouver, 2008. 71–80
Qi G J, Hua X S, Rui Y, et al. Correlative multi-label video annotation. In: Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, 2007. 17–26
Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning, Williamstown, 2001. 282–289
Sutton C, McCallum A. An introduction to conditional random fields for relational learning. In: Getoor L, Taskar B, eds. Introduction to Statistical Relational Learning. Cambridge: MIT Press, 2007
Zhu J, Nie Z Q, Wen J R, et al. 2d conditional random fields for web information extraction. In: Proceedings of the 22nd International Conference on Machine Learning, Bonn, 2005. 1044–1051
Jiang Y G, Ngo C W, Yang J. VIREO-374: Keypoint-based LSCOMSemantic Concept Detectors. http://vireo.cs.cityu.edu.hk/research/vireo374/
Yilmaz E, Aslam J A. Estimating average precision with incomplete and imperfect judgments. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, Arlington, 2006. 102–111
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yi, J., Peng, Y. & Xiao, J. A temporal context model for boosting video annotation. Sci. China Inf. Sci. 56, 1–14 (2013). https://doi.org/10.1007/s11432-012-4720-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-012-4720-6