Abstract
Micro-video is an increasingly prevalent social media form, which attracts much attention for its convenient acquisition and expressive ability. However, for the user-generated hashtags of micro-videos have seriously unbalanced distribution and low quality, the management of micro-videos becomes challenging. In this paper, we propose a novel tag refinement approach for micro-videos by learning from multiple public data sources with manually labelled tags, which can overcome the difficulty of directly refining the imprecise hashtags and address the problem of lacking manually labelled micro-video datasets for training. We define a set of target tags by referring to the widely used datasets for object, activity and scene detection. In tag refinement, we firstly transfer the tags from the images in NUS-WIDE to the micro-video keyframes by similarity measurement. Meanwhile, we complete the tags by detecting the objects, activities and scenes in micro-videos based on appearance features and motion features with the assistance of the datasets, namely, ImageNet, PASCAL VOC, HMDB51, UCF50 and SUN. We also denoise the hashtags by constructing the mapping relationships among hashtags and target tags based on the statistics on NUS-WIDE. The results of tag transfer, complement and denoising are finally linearly combined to generate the tag refinement results of micro-videos. To validate the performance, we construct a dataset with 600 micro-videos from Vine, and manually labelled the micro-videos with target tags. The experimental results show that our approach can obtain good performance in tag refinement of micro-videos by learning from multiple data sources.
Similar content being viewed by others
References
Bao BK, Zhu G, Shen J, Yan S (2013) Robust image analysis with sparse representation on quantized visual features. IEEE Trans Image Process 22(3):860–871
Chen J (2016) Multi-modal learning: Study on a large-scale micro-video data collection. In: ACM International Conference on Multimedia, pp 1454–1458
Chen J, Song X, Nie L, Wang X, Zhang H, Chua TS (2016) Micro tells macro: predicting the popularity of micro-videos via a transductive model. In: ACM International Conference on Multimedia, pp 898–907
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: ACM International Conference on Image and Video Retrieval, p 48
Chen L, Xu D, Tsang IWH, Luo J (2010) Tag-based web photo retrieval improved by batch mode re-tagging. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3440–3446
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Fellbaum C (1998) Wordnet Wiley Online Library
Gao L, Song J, Nie F, Yan Y, Sebe N, Tao Shen H (2015) Optimal graph learning with partial tags and multiple features for image and video annotation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 4371–4379
Gao K, Zhang Y, Luo P, Zhang W, Xia J, Lin S (2012) Visual stem mapping and geometric tense coding for augmented visual vocabulary. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3234–3241
Gao Z, Zhang H, Xu G, Xue Y (2015) Multi-perspective and multi-modality joint representation and recognition model for 3d action recognition. Neurocomputing 151:554–564
Gao K, Zhang Y, Zhang W, Lin S (2011) Mining concise and distinctive affine-stable features for object detection in large corpus. Int J Comput Math 88(18):3953–3962
Gao K, Zhang Y, Zhang D, Lin S (2013) Accurate off-line query expansion for large-scale mobile visual search. Signal Process 93(8):2305–2315
Huang L, Luo B (2016) Salient object detection via video spatio-temporal difference and coherence. In: International Conference on Computational Intelligence and Security, pp 1–5
Jin Y, Khan L, Wang L, Awad M (2005) Image annotations by combining multiple evidence & wordnet. In: ACM International Conference on Multimedia, pp 706–715
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. International Conference on Computer Vision 2556–2563
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Lin CY, Tseng BL, Smith JR (2003) Videoannex: Ibm mpeg-7 annotation tool for multimedia indexing and concept learning. In: IEEE International Conference on Multimedia and Expo, pp 1–2
Liu AA, Nie WZ, Gao Y, Su YT (2016) Multi-modal clique-graph matching for view-based 3d model retrieval. IEEE Trans Image Process 25(5):2103–2116
Liu AA, Su YT, Nie WZ, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114
Liu AA, Xu N, Nie WZ, Su YT, Wong Y, Kankanhalli M (2016) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Transactions on Cybernetics 1–14
Liu D, Hua XS, Yang L, Wang M, Zhang H (2009) Tag ranking. In: International Conference on World Wide Web, pp 351–360
Liu D, Hua XS, Zhang HJ (2011) Content-based tag processing for internet social images. Multimedia Tools and Applications 51(2):723–738
Liu J, Ren T, Wang Y, Zhong SH, Bei J, Chen S (2017) Object proposal on rgb-d images via elastic edge boxes. Neurocomputing
Nguyen PX, Rogez G, Fowlkes C, Ramamnan D (2016) The open world of micro-videos. arXiv:1603.09439
Reddyv KK, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–981
Redi M, O’Hare N, Schifanella R, Trevisiol M, Jaimes A (2014) 6 seconds of sound and vision: Creativity in micro-videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 4272–4279
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp 91–99
Ren T, Liu Y, Ju R, Wu G (2016) How important is location information in saliency detection of natural images. Multimedia Tools and Applications 75(5):2543–2564
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Trans Multimedia 14(3):883–895
Sano S, Yamasaki T, Aizawa K (2014) Degree of loop assessment in microvideo. In: IEEE International Conference on Image Processing, pp 5182–5186
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition
Sun C, Bao BK, Xu C (2015) Knowing verb from object: Retagging with transfer learning on verb-object concept images. IEEE Trans Multimedia 17(10):1747–1759
Tang J, Hong R, Yan S, Chua TS, Qi GJ, Jain R (2011) Image annotation by knn-sparse graph-based label propagation over noisily tagged web images. ACM Trans Intell Syst Technol 2(2):14
Tang J, Li M, Li Z, Zhao C (2015) Tag ranking based on salient region graph propagation. Multimedia Systems 21(3):267–275
Torralba A, Fergus R, Freeman WT (2008) 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, pp 3551–3558
Wang M, Ni B, Hua XS, Chua TS (2012) Assistive tagging: A survey of multimedia tagging with human-computer joint exploration. ACM Comput Surv 44(4):25
Weinberger KQ, Slaney M, Van Zwol R (2008) Resolving tag ambiguity. In: ACM International Conference on Multimedia, pp 111–120
Xiao J, Ehinger KA, Hays J, Torralba A, Oliva A (2014) Sun database: Exploring a large collection of scene categories. Int J Comput Vis 1–20
Xu X, Geng W, Ju R, Yang Y, Ren T, Wu G (2014) Obsir: Object-based stereo image retrieval. In: IEEE International Conference on Multimedia and Expo, pp 1–6
Xu X, Ren T, Wu G (2014) Clsh: Cluster-based locality-sensitive hashing. In: International Conference on Internet Multimedia Computing and Service, p 144
Yan R, Natsev A, Campbell M (2009) Hybrid tagging and browsing approaches for efficient manual image annotation. IEEE MultiMedia 16(2):0026–41
Yang S, Chen M, Pomerleau D, Sukthankar R (2010) Food recognition using statistics of pairwise local features. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2249–2256
Ye T, Zhang D, Gao K, Jin G, Zhang Y, Yuan Q (2014) Salient region detection: Integrate both global and local cues. In: IEEE International Conference on Multimedia and Expo, pp 1–6
Yong SP, Deng JD, Purvis MK (2013) Wildlife video key-frame extraction based on novelty detection in semantic context. Multimedia Tools and Applications 62(2):359–376
Yuan Z, Sang J, Xu C (2013) Tag-aware image classification via nested deep belief nets. In: IEEE International Conference on Multimedia and Expo, pp 1–6
Zhang J, Nie L, Wang X, He X, Huang X, Chua TS (2016) Shorter-is-better: Venue category estimation from micro-video. In: ACM International Conference on Multimedia 1415–1424
Zhang H, Shang X, Luan H, Wang M, Chua TS (2016) Learning from collective intelligence: Feature learning using social images and tags. ACM Trans Multimed Comput Commun Appl 13
Zhang H, Shen F, Liu W, He X, Luan H, Chua TS (2016) Discrete collaborative filtering. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, p 16
Zhang H, Zha ZJ, Yang Y, Yan S, Gao Y, Chua TS (2013) Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In: ACM international conference on Multimedia, pp 33–42
Zhong SH, Liu Y, Chen QC (2015) Visual orientation inhomogeneity based scale-invariant feature transform. Expert Syst Appl 42(13):5658–5667
Zhong SH, Liu Y, Ren F, Zhang J, Ren T (2013) Video saliency detection via dynamic consistent spatio-temporal attention modelling. In: AAAI Conference on artificial intelligence
Zhu G, Yan S, Ma Y (2010) Image tag refinement towards low-rank, content-tag prior and error sparsity. In: ACM International Conference on Multimedia, pp 461–470
Zitnick CL, Dollár P (2014) Edge boxes: Locating object proposals from edges. In: European Conference on Computer Vision, pp 391–405
Acknowledgements
The authors would like to thank the anonymous reviews for their helpful suggestion. This work is supported by National Science Foundation of China (61202320) and Research Project of Excellent State Key Laboratory (61223003).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Huang, L., Luo, B. Tag refinement of micro-videos by learning from multiple data sources. Multimed Tools Appl 76, 20341–20358 (2017). https://doi.org/10.1007/s11042-017-4781-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4781-z