Skip to main content
Log in

Tag refinement of micro-videos by learning from multiple data sources

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Micro-video is an increasingly prevalent social media form, which attracts much attention for its convenient acquisition and expressive ability. However, for the user-generated hashtags of micro-videos have seriously unbalanced distribution and low quality, the management of micro-videos becomes challenging. In this paper, we propose a novel tag refinement approach for micro-videos by learning from multiple public data sources with manually labelled tags, which can overcome the difficulty of directly refining the imprecise hashtags and address the problem of lacking manually labelled micro-video datasets for training. We define a set of target tags by referring to the widely used datasets for object, activity and scene detection. In tag refinement, we firstly transfer the tags from the images in NUS-WIDE to the micro-video keyframes by similarity measurement. Meanwhile, we complete the tags by detecting the objects, activities and scenes in micro-videos based on appearance features and motion features with the assistance of the datasets, namely, ImageNet, PASCAL VOC, HMDB51, UCF50 and SUN. We also denoise the hashtags by constructing the mapping relationships among hashtags and target tags based on the statistics on NUS-WIDE. The results of tag transfer, complement and denoising are finally linearly combined to generate the tag refinement results of micro-videos. To validate the performance, we construct a dataset with 600 micro-videos from Vine, and manually labelled the micro-videos with target tags. The experimental results show that our approach can obtain good performance in tag refinement of micro-videos by learning from multiple data sources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Bao BK, Zhu G, Shen J, Yan S (2013) Robust image analysis with sparse representation on quantized visual features. IEEE Trans Image Process 22(3):860–871

    Article  MathSciNet  Google Scholar 

  2. Chen J (2016) Multi-modal learning: Study on a large-scale micro-video data collection. In: ACM International Conference on Multimedia, pp 1454–1458

  3. Chen J, Song X, Nie L, Wang X, Zhang H, Chua TS (2016) Micro tells macro: predicting the popularity of micro-videos via a transductive model. In: ACM International Conference on Multimedia, pp 898–907

  4. Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: ACM International Conference on Image and Video Retrieval, p 48

  5. Chen L, Xu D, Tsang IWH, Luo J (2010) Tag-based web photo retrieval improved by batch mode re-tagging. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3440–3446

  6. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  7. Fellbaum C (1998) Wordnet Wiley Online Library

  8. Gao L, Song J, Nie F, Yan Y, Sebe N, Tao Shen H (2015) Optimal graph learning with partial tags and multiple features for image and video annotation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 4371–4379

  9. Gao K, Zhang Y, Luo P, Zhang W, Xia J, Lin S (2012) Visual stem mapping and geometric tense coding for augmented visual vocabulary. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 3234–3241

  10. Gao Z, Zhang H, Xu G, Xue Y (2015) Multi-perspective and multi-modality joint representation and recognition model for 3d action recognition. Neurocomputing 151:554–564

    Article  Google Scholar 

  11. Gao K, Zhang Y, Zhang W, Lin S (2011) Mining concise and distinctive affine-stable features for object detection in large corpus. Int J Comput Math 88(18):3953–3962

    Article  Google Scholar 

  12. Gao K, Zhang Y, Zhang D, Lin S (2013) Accurate off-line query expansion for large-scale mobile visual search. Signal Process 93(8):2305–2315

    Article  Google Scholar 

  13. Huang L, Luo B (2016) Salient object detection via video spatio-temporal difference and coherence. In: International Conference on Computational Intelligence and Security, pp 1–5

  14. Jin Y, Khan L, Wang L, Awad M (2005) Image annotations by combining multiple evidence & wordnet. In: ACM International Conference on Multimedia, pp 706–715

  15. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. International Conference on Computer Vision 2556–2563

  16. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  17. Lin CY, Tseng BL, Smith JR (2003) Videoannex: Ibm mpeg-7 annotation tool for multimedia indexing and concept learning. In: IEEE International Conference on Multimedia and Expo, pp 1–2

  18. Liu AA, Nie WZ, Gao Y, Su YT (2016) Multi-modal clique-graph matching for view-based 3d model retrieval. IEEE Trans Image Process 25(5):2103–2116

    Article  MathSciNet  Google Scholar 

  19. Liu AA, Su YT, Nie WZ, Kankanhalli M (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114

    Article  Google Scholar 

  20. Liu AA, Xu N, Nie WZ, Su YT, Wong Y, Kankanhalli M (2016) Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Transactions on Cybernetics 1–14

  21. Liu D, Hua XS, Yang L, Wang M, Zhang H (2009) Tag ranking. In: International Conference on World Wide Web, pp 351–360

  22. Liu D, Hua XS, Zhang HJ (2011) Content-based tag processing for internet social images. Multimedia Tools and Applications 51(2):723–738

    Article  Google Scholar 

  23. Liu J, Ren T, Wang Y, Zhong SH, Bei J, Chen S (2017) Object proposal on rgb-d images via elastic edge boxes. Neurocomputing

  24. Nguyen PX, Rogez G, Fowlkes C, Ramamnan D (2016) The open world of micro-videos. arXiv:1603.09439

  25. Reddyv KK, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–981

    Article  Google Scholar 

  26. Redi M, O’Hare N, Schifanella R, Trevisiol M, Jaimes A (2014) 6 seconds of sound and vision: Creativity in micro-videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 4272–4279

  27. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp 91–99

  28. Ren T, Liu Y, Ju R, Wu G (2016) How important is location information in saliency detection of natural images. Multimedia Tools and Applications 75(5):2543–2564

    Article  Google Scholar 

  29. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  30. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523

    Article  Google Scholar 

  31. Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Trans Multimedia 14(3):883–895

    Article  Google Scholar 

  32. Sano S, Yamasaki T, Aizawa K (2014) Degree of loop assessment in microvideo. In: IEEE International Conference on Image Processing, pp 5182–5186

  33. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition

  34. Sun C, Bao BK, Xu C (2015) Knowing verb from object: Retagging with transfer learning on verb-object concept images. IEEE Trans Multimedia 17(10):1747–1759

    Article  Google Scholar 

  35. Tang J, Hong R, Yan S, Chua TS, Qi GJ, Jain R (2011) Image annotation by knn-sparse graph-based label propagation over noisily tagged web images. ACM Trans Intell Syst Technol 2(2):14

    Article  Google Scholar 

  36. Tang J, Li M, Li Z, Zhao C (2015) Tag ranking based on salient region graph propagation. Multimedia Systems 21(3):267–275

    Article  Google Scholar 

  37. Torralba A, Fergus R, Freeman WT (2008) 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans Pattern Anal Mach Intell 30(11):1958–1970

    Article  Google Scholar 

  38. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, pp 3551–3558

  39. Wang M, Ni B, Hua XS, Chua TS (2012) Assistive tagging: A survey of multimedia tagging with human-computer joint exploration. ACM Comput Surv 44(4):25

    Article  Google Scholar 

  40. Weinberger KQ, Slaney M, Van Zwol R (2008) Resolving tag ambiguity. In: ACM International Conference on Multimedia, pp 111–120

  41. Xiao J, Ehinger KA, Hays J, Torralba A, Oliva A (2014) Sun database: Exploring a large collection of scene categories. Int J Comput Vis 1–20

  42. Xu X, Geng W, Ju R, Yang Y, Ren T, Wu G (2014) Obsir: Object-based stereo image retrieval. In: IEEE International Conference on Multimedia and Expo, pp 1–6

  43. Xu X, Ren T, Wu G (2014) Clsh: Cluster-based locality-sensitive hashing. In: International Conference on Internet Multimedia Computing and Service, p 144

  44. Yan R, Natsev A, Campbell M (2009) Hybrid tagging and browsing approaches for efficient manual image annotation. IEEE MultiMedia 16(2):0026–41

    Article  Google Scholar 

  45. Yang S, Chen M, Pomerleau D, Sukthankar R (2010) Food recognition using statistics of pairwise local features. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 2249–2256

  46. Ye T, Zhang D, Gao K, Jin G, Zhang Y, Yuan Q (2014) Salient region detection: Integrate both global and local cues. In: IEEE International Conference on Multimedia and Expo, pp 1–6

  47. Yong SP, Deng JD, Purvis MK (2013) Wildlife video key-frame extraction based on novelty detection in semantic context. Multimedia Tools and Applications 62(2):359–376

    Article  Google Scholar 

  48. Yuan Z, Sang J, Xu C (2013) Tag-aware image classification via nested deep belief nets. In: IEEE International Conference on Multimedia and Expo, pp 1–6

  49. Zhang J, Nie L, Wang X, He X, Huang X, Chua TS (2016) Shorter-is-better: Venue category estimation from micro-video. In: ACM International Conference on Multimedia 1415–1424

  50. Zhang H, Shang X, Luan H, Wang M, Chua TS (2016) Learning from collective intelligence: Feature learning using social images and tags. ACM Trans Multimed Comput Commun Appl 13

  51. Zhang H, Shen F, Liu W, He X, Luan H, Chua TS (2016) Discrete collaborative filtering. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, p 16

  52. Zhang H, Zha ZJ, Yang Y, Yan S, Gao Y, Chua TS (2013) Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In: ACM international conference on Multimedia, pp 33–42

  53. Zhong SH, Liu Y, Chen QC (2015) Visual orientation inhomogeneity based scale-invariant feature transform. Expert Syst Appl 42(13):5658–5667

    Article  Google Scholar 

  54. Zhong SH, Liu Y, Ren F, Zhang J, Ren T (2013) Video saliency detection via dynamic consistent spatio-temporal attention modelling. In: AAAI Conference on artificial intelligence

  55. Zhu G, Yan S, Ma Y (2010) Image tag refinement towards low-rank, content-tag prior and error sparsity. In: ACM International Conference on Multimedia, pp 461–470

  56. Zitnick CL, Dollár P (2014) Edge boxes: Locating object proposals from edges. In: European Conference on Computer Vision, pp 391–405

Download references

Acknowledgements

The authors would like to thank the anonymous reviews for their helpful suggestion. This work is supported by National Science Foundation of China (61202320) and Research Project of Excellent State Key Laboratory (61223003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Luo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, L., Luo, B. Tag refinement of micro-videos by learning from multiple data sources. Multimed Tools Appl 76, 20341–20358 (2017). https://doi.org/10.1007/s11042-017-4781-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-017-4781-z

Keywords

Navigation