Skip to main content
Log in

Multi-modal tag localization for mobile video search

  • Special Issue Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Given the tremendous growth of mobile videos, video tag localization, which localizes the relevant video clips for an associated semantic tag, is becoming increasingly important to influence users browsing and searching experience. However, most existing approaches adopt and depend to large degree on carefully selected visual features, which are manually designed by experts and do not take multi-modality into consideration. Aiming to take into account complementarity of different modalities, in this paper, we propose a multi-modal tag localization framework by exploiting deep learning to learn visual, auditory, and semantic features of videos for tag localization. Furthermore, we showcase that the framework can be applied to two novel mobile video search applications: (1) automatic time-code-level tags generation and (2) query-dependent video thumbnail selection. Extensive experiments on the public dataset show that the proposed approach achieves promising results, which obtains \(7.6~\%\) improvement beyond the state-of-the-arts. Finally, the subjective evaluation of usability demonstrates the proposed applications can significantly improve the user’s mobile video search experience.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1.  “Youtube App,” http://www.youtube.com/yt/devices/.

  2.  “Yahoo Screen,” https://us.mobile.yahoo.com/screen/.

  3.  “YouTube Captions API,” https://developers.google.com/youtube/v3/docs/captions

References

  1. Liu, W., Mei, T., Zhang, Y.: Instant mobile video search with layered audio-video indexing and progressive transmission. IEEE Trans. Multimed. 16(8), 2242–2255 (2014)

    Article  Google Scholar 

  2. Ballan, L., Bertini, M., Del Bimbo, A., Meoni, M., Serra, G.: Tag suggestion and localization in user-generated videos based on social knowledge. In: ACM SIGMM workshop on social media, pp. 3–8 (2010)

  3. Chu, W.-T., Li, C.-J., Chou, Y.-K.: Tag suggestion and localization for web videos by bipartite graph matching. In: ACM SIGMM workshop on Social media, pp. 35–40 (2011)

  4. Ulges, A., Schulze, C., Breuel, T.: Multiple instance learning from weakly labeled videos. In: SAMT workshop on cross-media information analysis and retrieval (2008)

  5. Li, G., Wang, M., Zheng, Y., Li, H., Zha, Z., Chua, T.: Shottagger: tag location for internet videos. In: ICMR, p. 37 (2011)

  6. Fergus, R., Li, F., Perona, P., Zisserman, A.: Learning object categories from google’s image search. In: ICCV, pp. 1816–1823 (2005)

  7. Li, H., Yi, L., Liu, B., Wang, Y.: Localizing relevant frames in web videos using topic model and relevance filtering. Mach. Vis. Appl. 25(7), 1661–1670 (2014)

  8. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)

  9. Zhang, R., Tang, S., Liu, W., Li, J.: Multimodal tag localization based on deep learning. In: Proceedings of the 7th international conference on internet multimedia computing and service, ICIMCS 2015, Zhangjiajie, August 19–21, pp. 50:1–50:4 (2015)

  10. Girshick, R.B.: Fast R-CNN. CoRR, vol. abs/1504.08083 (2015)

  11. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, vol. abs/1409.1556 (2014)

  12. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

  13. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1), 1929–1958 (2014)

  14. Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)

  15. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. CoRR, vol. abs/1411.4038 (2014)

  16. Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)

  17. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: a deep convolutional activation feature for generic visual recognition. In: ICML, pp. 647–655 (2014)

  18. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)

  19. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)

  20. Tao, D., Jin, L., Liu, W., Li, X.: Hessian regularized support vector machines for mobile image annotation on the cloud. IEEE Trans. Multimed. 15(4), 833–844 (2013)

    Article  Google Scholar 

  21. Mei, T., Wang, Y., Hua, X., Gong, S., Li, S.: Coherent image annotation by learning semantic distance. In: CVPR, 24–26 June, Anchorage (2008)

  22. Mei, T., Yang, B., Hua, X., Li, S.: Contextual video recommendation by multimodal relevance and user feedback. ACM Trans. Inf. Syst. 29(2), 10 (2011)

    Article  Google Scholar 

  23. Liu, W., Tao, D.: Multiview hessian regularization for image annotation. IEEE Trans. Image Process. 22(7), 2676–2687 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  24. Liu, W., Tao, D., Cheng, J., Tang, Y.: Multiview hessian discriminative sparse coding for image annotation. Comput. Vis. Image Underst. 118, 50–60 (2014)

    Article  Google Scholar 

  25. Xu, C., Tao, D., Xu, C.: Multi-view intact space learning. IEEE Trans. Pattern Anal. Mach. Intell. 37(12), 2531–2544 (2015)

    Article  Google Scholar 

  26. Chang, X., Dacheng, T., Chao, X.: Large-margin multi-view information bottleneck. Pattern Anal. Mach. Intell. IEEE Trans. 36(8), 1559–1572 (2014)

    Article  Google Scholar 

  27. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML, pp. 689–696 (2011)

  28. Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 15(1), 2949–2980 (2014)

    MATH  MathSciNet  Google Scholar 

  29. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: a deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)

  30. W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo, “Multi-task deep visual-semantic embedding for video thumbnail selection,” in CVPR, 2015, pp. 3707–3715

  31. Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. CoRR, vol. abs/1505.01861 (2015)

  32. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)

  33. Kang, H., Hua, X.: To learn representativeness of video frames. In: ACM MM, pp. 423–426 (2005)

  34. Luo, J., Papin, C., Costello, K.: Towards extracting semantically meaningful key frames from personal video clips: from humans to computers. IEEE Trans. Circ. Syst. Video Technol. 19(2), 289–301 (2009)

    Article  Google Scholar 

  35. Li, H., Yi, L., Guan, Y., Zhang, H.: DUT-WEBV: a benchmark dataset for performance evaluation of tag localization for web video. In: MMM, pp. 305–315 (2013)

  36. Word2vec [Online]. https://code.google.com/p/word2vec/

  37. Ulges, A., Schulze, C., Keysers, D., Breuel, T.M.: Identifying relevant frames in weakly labeled videos for training concept detectors. In: CIVR, pp. 9–16 (2008)

  38. Ballan, L., Bertini, M., Serra, G., Bimbo, A.D.: A data-driven approach for tag refinement and localization in web videos. CoRR, vol. abs/1407.0623 (2014)

Download references

Acknowledgments

This work was supported by 863 Project (2014AA015202) National Nature Science Foundation of China (61525206, 61173054, 61572472), Beijing Natural Science Foundation (4152050), the Funds for Creative Research Groups of China under Grant (61421061) and the Cosponsored Project of Beijing Committee of Education. This paper is also supported by Beijing Advanced Innovation Center for Imaging Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wu Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, R., Tang, S., Liu, W. et al. Multi-modal tag localization for mobile video search. Multimedia Systems 23, 713–724 (2017). https://doi.org/10.1007/s00530-016-0506-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-016-0506-9

Keywords

Navigation