ABSTRACT
Social video sharing websites allow users to annotate videos with descriptive keywords called tags, which greatly facilitate video search and browsing. However, many tags only describe part of the video content, without any temporal indication on when the tag actually appears. Currently, there is very little research on automatically assigning tags to shot-level segments of a video. In this paper, we leverage user's tags as a source to analyze the content within the video and develop a novel system named ShotTagger to assign tags at the shot level. There are two steps to accomplish the location of tags at shot level. The first is to estimate the distribution of tags within the video, which is based on a multiple instance learning framework. The second is to perform the semantic correlation of a tag with other tags in a video in an optimization framework and impose the temporal smoothness across adjacent video shots to refine the tagging results at shot level. We present different applications to demonstrate the usefulness of the tag location scheme in searching, and browsing of videos. A series of experiments conducted on a set of Youtube videos has demonstrated the feasibility and effectiveness of our approach.
- Google multimedia research interest: http://googleresearch.blogspot.com/2009/12/research-areas-of-interest-multimedia.html/.Google Scholar
- Trec video retrieval evaluation: http://www-nlpir.nist.gov/projects/trecvid/.Google Scholar
- Youtube video: http://www.youtube.com/.Google Scholar
- D. Borth, J. Hees, M. Koch, A. Ulges, C. Schulze, T. Breuel, and R. Paredes. Tubefiler: an automatic web video categorizer. In MM '09: Proceedings of the seventeen ACM international conference on Multimedia, pages 1111--1112, 2009. Google ScholarDigital Library
- K.-Y. Cheng, S.-J. Luo, B.-Y. Chen, and H.-H. Chu. Smartplayer: user-centric video fast-forwarding. In CHI '09: Proceedings of the 27th international conference on Human factors in computing systems, 2009. Google ScholarDigital Library
- H. Feng, A. Chandrashekhara, and T.-S. Chua. Atmra: An automatic temporal multi-resolution analysis framework for shot boundary detection. In MMM, 2003.Google Scholar
- M. B. G. Schindler, L. Zitnick. Internet video category recognition. In IEEE Workshop on Internet Vision, 2008.Google ScholarCross Ref
- W. Jiang, C. Cotton, S.-F. Chang, D. Ellis, and A. Loui. Short-term audio-visual atoms for generic video concept classification. In MM '09: Proceedings of the seventeen ACM international conference on Multimedia, 2009. Google ScholarDigital Library
- Y. Jun and H. Alex. Exploring temporal consistency for video retrieval and analysis. In MIR, 2006.Google Scholar
- L. Kennedy. Revision of LSCOM Event/Activity Annotations, DTO Challenge Workshop on Large Scale Concept Ontology for Multimedia. Technical report, Columbia University, December 2006.Google Scholar
- Y. Linjun, Y. Yichen, and H. Xian-Sheng. Smart video player. In IEEE Conference on Multimedia and Expo (ICME), 2008.Google ScholarCross Ref
- J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos "in the wild". In IEEE International Conference on Computer Vision and Pattern Recognition, 2009.Google ScholarCross Ref
- D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 2004. Google ScholarDigital Library
- C. D. Manning, P. Raghavan, and H. Schtze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarCross Ref
- O. Maron and T. Lozano-Pérez. A framework for multiple-instance learning. In NIPS '97: Proceedings of the 1997 conference on Advances in neural information processing systems 10, 1998. Google ScholarDigital Library
- W. Meng, Y. Kuiyuan, H. Xiansheng, and Z. Hong-Jiang. Towards a relevant and diverse search of social images. IEEE Transactions on Multimedia, 12(8):829--842, 2010. Google ScholarDigital Library
- W. Meng and H. Xian-Sheng. Active learning in multimedia annotation and retrieval: A survey. ACM Transactions on Intelligent Systems and Technology. Google ScholarDigital Library
- W. Meng, H. Xian-Sheng, H. Richang, T. Jinhui, and S. Yan. Unified video annotation via multi-graph learning. IEEE Transactions on Circuits and Systems for Video Technology, 19(5):733--746, 2009. Google ScholarDigital Library
- W. Meng, H. XianSsheng, T. Jinhui, and H. Richang. Beyond distance measurement: Constructing neighborhood similarity for video annotation. IEEE Transactions on Multimedia, 11(3):465--476, 2009. Google ScholarDigital Library
- D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In CVPR '06: Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006. Google ScholarDigital Library
- Z. Qi and S. A. Goldman. Em-dd: An improved multiple-instance learning technique. In In Advances in Neural Information Processing Systems, pages 1073--1080. MIT Press, 2001.Google Scholar
- R. Rahmani, S. A. Goldman, H. Zhang, J. Krettek, and J. E. Fritts. Localized content based image retrieval. In MIR '05: Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval, 2005. Google ScholarDigital Library
- S. Ray and M. Craven. Supervised versus multiple instance learning: An empirical comparison. In Proceedings of 22nd International Conference on Machine Learning (ICML), pages 697--704. ACM Press, 2005. Google ScholarDigital Library
- B. Settles, M. Craven, and S. Ray. Multiple-instance active learning. In In Advances in Neural Information Processing Systems (NIPS), pages 1289--1296. MIT Press, 2008.Google Scholar
- S. Siersdorfer, J. San Pedro, and M. Sanderson. Automatic video tagging using content redundancy. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 2009. Google ScholarDigital Library
- B. Sigurbjörnsson and R. van Zwol. Flickr tag recommendation based on collective knowledge. In WWW '08: Proceeding of the 17th international conference on World Wide Web, pages 327--336, 2008. Google ScholarDigital Library
- A. Ulges, C. Schulze, D. Keysers, and T. Breuel. Identifying relevant frames in weakly labeled videos for training concept detectors. In Proceedings of the 2008 international conference on Content-based image and video retrieval, CIVR '08, 2008. Google ScholarDigital Library
- A. Ulges, C. Schulze, M. Koch, and T. M. Breuel. Learning automatic concept detectors from online video. Comput. Vis. Image Underst., 114(4):429--438, 2010. Google ScholarDigital Library
- W. Xiao, Z. Wan-Lei, and N. Chong-Wah. Towards google challenge: combining contextual and social information for web video categorization. In MM '09: Proceedings of the seventeen ACM international conference on Multimedia, 2009. Google ScholarDigital Library
- L. Yang, J. Liu, X. Yang, and X.-S. Hua. Multi-modality web video categorization. In MIR '07: Proceedings of the international workshop on Workshop on multimedia information retrieval, pages 265--274, 2007. Google ScholarDigital Library
- Z.-J. Zha, X.-S. Hua, T. Mei, J. Wang, G.-J. Qi, and Z. Wang. Joint multi-label multi-instance learning for image classification. In Proceedings of the IEEE international conference on Computer Vision and Pattern Recognition, pages 01--08, 2008.Google Scholar
- Z.-J. Zha, T. Mei, J. Wang, Z. Wang, and X.-S. Hua. Graph-based semi-supervised learning with multiple labels. Journal of Visual Communication and Image Representation, 20:97--103, 2009. Google ScholarDigital Library
- Z.-J. Zha, L. Yang, T. Mei, M. Wang, and Z. Wang. Visual query suggestion. In Proceedings of the ACM international conference on Multimedia, pages 15--24, 2009. Google ScholarDigital Library
Index Terms
- ShotTagger: tag location for internet videos
Recommendations
Tag suggestion and localization in user-generated videos based on social knowledge
WSM '10: Proceedings of second ACM SIGMM workshop on Social mediaNowadays, almost any web site that provides means for sharing user-generated multimedia content, like Flickr, Facebook, YouTube and Vimeo, has tagging functionalities to let users annotate the material that they want to share. The tags are then used to ...
Tag Suggestr: Automatic Photo Tag Expansion Using Visual Information for Photo Sharing Websites
SAMT '08: Proceedings of the 3rd International Conference on Semantic and Digital Media Technologies: Semantic MultimediaIn this paper, we propose an automatic photo tag expansion system for the community photo collections, such as Flickr. Our aim is to suggest relevant tags for a target photograph uploaded to the system by a user, by incorporating the visual and textual ...
Automatic tag expansion using visual similarity for photo sharing websites
In this paper we present an automatic photo tag expansion method designed for photo sharing websites. The purpose of the method is to suggest tags that are relevant to the visual content of a given photo at upload time. Both textual and visual cues are ...
Comments