Abstract
Online social video websites such as YouTube allow users to manually annotate their video documents with textual labels. These labels can be used as indexing keywords to facilitate search and organization of video data. However, manual video annotation is usually a labor-intensive and time-consuming process. In this work, we propose a novel social video annotation approach that combines multiple feature sets based on a tri-adaptation approach. For the shots in each video, they are annotated by aggregating models that are learned from three complementary feature sets. Meanwhile, the models are collaboratively adapted by exploring unlabeled shots. In this sense, the method can be viewed as a novel semi-supervised algorithm that explores three complementary views. Our approach also exploits the temporal smoothness of video labels by applying a label correction strategy. Experiments on a web video dataset demonstrate the effectiveness of the proposed approach.







Similar content being viewed by others
References
TRECVID: TREC video retrieval evaluation. [Online]. Available: http://www-nlpir.nist.gov/projects/trecvid
Amir, A., Argillander, J., Campbell, M., et al.: IBM research TRECVID-2005 video retrieval system. In Proceedings of TREC Video Retrieval Online Proceedings (2005)
Belkin, M., Matveeva, I., Niyogi, P.: Regularization and semi-supervised learning on large graphs. In: Proceedings of the Conference on Computational Learning Theory, pp. 624–638 (2004)
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7(12), 2399–2434 (2006)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Conference on Computational Learning Theory, pp. 92–100 (1998)
Castelli, V., Cover, T.: The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans. Inf. Theory 42(6), 2102–2117 (1996)
Duan, L., Xu, D., Tsang, I.W.-H., Luo, J.: Visual event recognition in videos by learning from web data. IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 1667–1680 (2012)
Gao, Y., Wang, F., Luan, H., Chua, T.-S.: Brand data gathering from live social media streams. In: Proceedings of ACM International Conference on Multimedia Retrieval, pp. 169–176 (2014)
Gao, Y., Wang, M., Luan, H., Shen, J., Yan, S., Tao, D.: Tag-based social image search with visual-text joint hypergraph learning. In: Proceedings of ACM International Conference on Multimedia, pp. 1517–1520 (2011)
Gao, Y., Wang, M., Zha, Z.-J., Shen, J., Li, X., Wu, X.: Visual-textual joint relevance learning for tag-based social image search. IEEE Trans. Image Process. 22(1), 363–376 (2013)
Gao, Y., Tang, J., Hong, R., Yan, S., Dai, Q., Zhang, N., Chua, T.-S.: Camera constraint-free view-based 3-d object retrieval. IEEE Trans. Image Process. 21(4), 2269–2281 (2012)
He, J., Li, M., Zhang, H.-J., Tong, H., Zhang, C.: Manifold-ranking based image retrieval. In: Proceedings of the ACM International Conference on Multimedia (2004)
Leggetter, C.J., Woodland, P.C.: Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Comput. Speech Lang. 9(2), 171–185 (1995)
Li, G., Wang, M., Lu, Z., Hong, R., Chua, T.-S.: In-video product annotation with web information mining. ACM Trans Multimed Comput Commun Appl 8(4), 55:1–55:19 (2012)
Li, P., Wang, M., Cheng, J., Xu, C., Lu, H.: Spectral hashing with semantically consistent graph for image indexing. IEEE Trans. Multimed. 15(1), 141–152 (2013)
Naphade, M.R., Smith, J.R.: On the detection of semantic concepts at TRECVID. In: Proceedings of the ACM International Conference on Multimedia, pp. 660–667 (2004)
Nie, L., Wang, M., Gao, Y., Zha, Z.J., Chua, T.S.: Beyond text QA: multimedia answer generation by harvesting web information. IEEE Trans. Multimed. 15(2), 426–441 (2013)
Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of Conference on Information and Knowledge Management, pp. 86–93 (2000)
Nigam, K., Mccallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using em. Mach. Learn. 39(2–3), 103–134 (1999)
Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised self-training of object detection models. In: Proceedings of the IEEE Workshop on Applications of Computer Vision, pp. 29–36 (2005)
Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid. In: Proceedings of the ACM International Workshop on Multimedia Information Retrieval, pp. 321–330 (2006)
Snoek, C.G.M., Worring, M., Geusebroek, J.-M., Koelma, D.C., Seinstra, F.J., Smeulders, A.W.M.: The semantic pathfinder: using an authoring metaphor for generic multimedia indexing. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1678–1689 (2006)
Song, Y., Hua, X., Dai, L., Wang, M.: Semi-automatic video annotation based on active learning with multiple complementary predictors. In Proceedings of the International Workshop on Multimedia Information Retrieval, pp. 97–104 (2005)
Song, Y., Hua, X.S., Dai, LR., Wang, M.: Semi-automatic video annotation based on active learning with multiple complementary predictors. In Proceedings of International Workshop on Multimedia Information Retrieval, pp. 97–104 (2005)
Tang, J., Hua, X.-S., jun Qi, G., Wang, M., Mei, T., Wu, X.: Structure-sensitive manifold ranking for video concept detection. In: Proceedings of the ACM International Conference on Multimedia (2007)
Tang, J., Hua, X.-S., Wang, M., Gu, Z., Qi, G.-J., Wu, X.: Correlative linear neighborhood propagation for video annotation. IEEE Trans. Syst. Man Cybern. Part B Cybern. 39(2), 409–416 (2009)
Wang, D., Liu, X., Luo, L., Li, J., Zhang, B.: Video diver: Generic video indexing with diverse features. In: Proceedings of the International Workshop on Multimedia Information Retrieval, pp. 61–70 (2007)
Wang, M., Hong, R., Li, G., Zha, Z.-J., Yan, S., Chua, T.-S.: Event driven web video summarization by tag localization and key-shot identification. IEEE Trans. Multimed. 14(4), 975–985 (2012)
Wang, M., Hua, X.-S., Hong, R., Tang, J., Qi, G.-J., Song, Y.: Unified video annotation via multigraph learning. IEEE Trans. Circuits Syst. Video Technol. 19(5), 733–746 (2009)
Wang, M., Hua, X.-S., Mei, T., Hong, R., Qi, G., Song, Y., Dai, L.-R.: Semi-supervised kernel density estimation for video annotation. Comput. Vision Image Underst. 113(3), 384–396 (2009)
Wang, M., Hua, X.-S., Song, Y., Dal, L.-R., Li, S.: Automatic video annotation based on co-adaptation and label correction. In: Proceedings of the IEEE International Symposium on Circuits and Systems (2006)
Wang, M., Hua, X.-S., Tang, J., Hong, R.: Beyond distance measurement: constructing neighborhood similarity for video annotation. IEEE Trans. Multimed. 11(3), 465–476 (2009)
Wang, M., Ni, B., Hua, X.-S., Chua, T.-S.: Assistive tagging: a survey of multimedia tagging with human-computer joint exploration. ACM Comput. Surv. 44(4), 25:1–25:24 (2012)
Wu, J., Hua, X.-S., Zhang, H.-J., Zhang, B.: An online-optimized incremental learning framework for video semantic classification. In Proceedings of the ACM International Conference on Multimedia, pp. 320–323 (2004)
Wu, Y., Tian, Q., Huang, T.: Discriminant-em algorithm with application to image retrieval. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, pp. 222–227 (2000)
Yan, R., Naphade, M.: “Semi-supervised cross feature learning for semantic concept detection in video”, in CVPR (2005)
Yang, Y., Ma, Z., Xu, Z., Yan, S., Hauptmann, A.: How related exemplars help complex event detection in web videos. In: Proceedings of the International Conference on Computer Vision (2013)
Yeung, M., Yeo, B.-L., Liu, B.: Extracting story units from long programs for video browsing and navigation. In: Proceedings of the IEEE International Conference on Multimedia Computing and Systems (1996)
Yuan, X., Hua, X.-S., Wang, M., Wu, X.-Q.: Manifold-ranking based video concept detection on large database and feature pool. In: Proceedings of the ACM International Conference on Multimedia (2006)
Zhang, D., Lee, W.: Validating co-training models for web image classification. Technical Report, NUS (2006)
Zhang, T., Oles, F. J.: A probability analysis on the value of unlabeled data for classification problems. In Proceedings of the International Conference on Machine Learning (2000)
Zhong, D., Zhang, H.: Clustering methods for video browsing and annotation. In Proceedings of the SPIE Conference on Storage and Retrieval for Image and Video Databases (1997)
Zhou, D., Bousquet, O., Lal, T. N., Weston, J., Scholkopf, B.: Learning with local and global consistency. In: Proceedings of the Conference on Advances in Neural Information Processing Systems, pp. 321–328 (2004)
Zhou, Z.-H., Li, M.: Tri-training: exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17(11), 1529–1541 (2005)
Zhu, X.: Semi-supervised learning literature survey. University of Wisconsin-Madison, Technical Report (2006)
Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the Conference on International Conference on Machine Learning, pp. 912–919 (2003)
Acknowledgments
The authors sincerely appreciate the useful comments and suggestions from the anonymous reviewers. This work was supported by National Natural Science Fund of China (Grant No. 61272214, 61173104, 61301222), China Postdoctoral Science Foundation (Grant No. 2013M541821), Fundamental Research Funds for the Central Universities (Grant No. 2013HGQC0018, 2013HGBH0027, 2013HGBZ0166).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sun, F., Xu, M., Li, H. et al. Social video annotation by combining features with a tri-adaptation approach. Multimedia Systems 22, 413–422 (2016). https://doi.org/10.1007/s00530-014-0405-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-014-0405-x