Multi-modal tag localization for mobile video search

Zhang, Rui; Tang, Sheng; Liu, Wu; Zhang, Yongdong; Li, Jintao

doi:10.1007/s00530-016-0506-9

Multi-modal tag localization for mobile video search

Special Issue Paper
Published: 13 April 2016

Volume 23, pages 713–724, (2017)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Rui Zhang^1,3,
Sheng Tang¹,
Wu Liu²,
Yongdong Zhang¹ &
…
Jintao Li¹

429 Accesses
1 Citation
Explore all metrics

Abstract

Given the tremendous growth of mobile videos, video tag localization, which localizes the relevant video clips for an associated semantic tag, is becoming increasingly important to influence users browsing and searching experience. However, most existing approaches adopt and depend to large degree on carefully selected visual features, which are manually designed by experts and do not take multi-modality into consideration. Aiming to take into account complementarity of different modalities, in this paper, we propose a multi-modal tag localization framework by exploiting deep learning to learn visual, auditory, and semantic features of videos for tag localization. Furthermore, we showcase that the framework can be applied to two novel mobile video search applications: (1) automatic time-code-level tags generation and (2) query-dependent video thumbnail selection. Extensive experiments on the public dataset show that the proposed approach achieves promising results, which obtains \(7.6~\%\) improvement beyond the state-of-the-arts. Finally, the subjective evaluation of usability demonstrates the proposed applications can significantly improve the user’s mobile video search experience.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

IVIST: Interactive Video Search Tool in VBS 2022

Improving Micro-video Recommendation by Controlling Position Bias

DUT-WEBV: A Benchmark Dataset for Performance Evaluation of Tag Localization for Web Video

Notes

“Youtube App,” http://www.youtube.com/yt/devices/.
“Yahoo Screen,” https://us.mobile.yahoo.com/screen/.
“YouTube Captions API,” https://developers.google.com/youtube/v3/docs/captions

References

Liu, W., Mei, T., Zhang, Y.: Instant mobile video search with layered audio-video indexing and progressive transmission. IEEE Trans. Multimed. 16(8), 2242–2255 (2014)
Article Google Scholar
Ballan, L., Bertini, M., Del Bimbo, A., Meoni, M., Serra, G.: Tag suggestion and localization in user-generated videos based on social knowledge. In: ACM SIGMM workshop on social media, pp. 3–8 (2010)
Chu, W.-T., Li, C.-J., Chou, Y.-K.: Tag suggestion and localization for web videos by bipartite graph matching. In: ACM SIGMM workshop on Social media, pp. 35–40 (2011)
Ulges, A., Schulze, C., Breuel, T.: Multiple instance learning from weakly labeled videos. In: SAMT workshop on cross-media information analysis and retrieval (2008)
Li, G., Wang, M., Zheng, Y., Li, H., Zha, Z., Chua, T.: Shottagger: tag location for internet videos. In: ICMR, p. 37 (2011)
Fergus, R., Li, F., Perona, P., Zisserman, A.: Learning object categories from google’s image search. In: ICCV, pp. 1816–1823 (2005)
Li, H., Yi, L., Liu, B., Wang, Y.: Localizing relevant frames in web videos using topic model and relevance filtering. Mach. Vis. Appl. 25(7), 1661–1670 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)
Zhang, R., Tang, S., Liu, W., Li, J.: Multimodal tag localization based on deep learning. In: Proceedings of the 7th international conference on internet multimedia computing and service, ICIMCS 2015, Zhangjiajie, August 19–21, pp. 50:1–50:4 (2015)
Girshick, R.B.: Fast R-CNN. CoRR, vol. abs/1504.08083 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR, vol. abs/1409.1556 (2014)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1), 1929–1958 (2014)
Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. CoRR, vol. abs/1411.4038 (2014)
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: a deep convolutional activation feature for generic visual recognition. In: ICML, pp. 647–655 (2014)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)
Tao, D., Jin, L., Liu, W., Li, X.: Hessian regularized support vector machines for mobile image annotation on the cloud. IEEE Trans. Multimed. 15(4), 833–844 (2013)
Article Google Scholar
Mei, T., Wang, Y., Hua, X., Gong, S., Li, S.: Coherent image annotation by learning semantic distance. In: CVPR, 24–26 June, Anchorage (2008)
Mei, T., Yang, B., Hua, X., Li, S.: Contextual video recommendation by multimodal relevance and user feedback. ACM Trans. Inf. Syst. 29(2), 10 (2011)
Article Google Scholar
Liu, W., Tao, D.: Multiview hessian regularization for image annotation. IEEE Trans. Image Process. 22(7), 2676–2687 (2013)
Article MATH MathSciNet Google Scholar
Liu, W., Tao, D., Cheng, J., Tang, Y.: Multiview hessian discriminative sparse coding for image annotation. Comput. Vis. Image Underst. 118, 50–60 (2014)
Article Google Scholar
Xu, C., Tao, D., Xu, C.: Multi-view intact space learning. IEEE Trans. Pattern Anal. Mach. Intell. 37(12), 2531–2544 (2015)
Article Google Scholar
Chang, X., Dacheng, T., Chao, X.: Large-margin multi-view information bottleneck. Pattern Anal. Mach. Intell. IEEE Trans. 36(8), 1559–1572 (2014)
Article Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML, pp. 689–696 (2011)
Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 15(1), 2949–2980 (2014)
MATH MathSciNet Google Scholar
Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: a deep visual-semantic embedding model. In: NIPS, pp. 2121–2129 (2013)
W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo, “Multi-task deep visual-semantic embedding for video thumbnail selection,” in CVPR, 2015, pp. 3707–3715
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. CoRR, vol. abs/1505.01861 (2015)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Kang, H., Hua, X.: To learn representativeness of video frames. In: ACM MM, pp. 423–426 (2005)
Luo, J., Papin, C., Costello, K.: Towards extracting semantically meaningful key frames from personal video clips: from humans to computers. IEEE Trans. Circ. Syst. Video Technol. 19(2), 289–301 (2009)
Article Google Scholar
Li, H., Yi, L., Guan, Y., Zhang, H.: DUT-WEBV: a benchmark dataset for performance evaluation of tag localization for web video. In: MMM, pp. 305–315 (2013)
Word2vec [Online]. https://code.google.com/p/word2vec/
Ulges, A., Schulze, C., Keysers, D., Breuel, T.M.: Identifying relevant frames in weakly labeled videos for training concept detectors. In: CIVR, pp. 9–16 (2008)
Ballan, L., Bertini, M., Serra, G., Bimbo, A.D.: A data-driven approach for tag refinement and localization in web videos. CoRR, vol. abs/1407.0623 (2014)

Download references

Acknowledgments

This work was supported by 863 Project (2014AA015202) National Nature Science Foundation of China (61525206, 61173054, 61572472), Beijing Natural Science Foundation (4152050), the Funds for Creative Research Groups of China under Grant (61421061) and the Cosponsored Project of Beijing Committee of Education. This paper is also supported by Beijing Advanced Innovation Center for Imaging Technology.

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Rui Zhang, Sheng Tang, Yongdong Zhang & Jintao Li
Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing, China
Wu Liu
University of Chinese Academy of Sciences, Beijing, China
Rui Zhang

Authors

Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Tang
View author publications
You can also search for this author in PubMed Google Scholar
Wu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yongdong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jintao Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wu Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, R., Tang, S., Liu, W. et al. Multi-modal tag localization for mobile video search. Multimedia Systems 23, 713–724 (2017). https://doi.org/10.1007/s00530-016-0506-9

Download citation

Published: 13 April 2016
Issue Date: November 2017
DOI: https://doi.org/10.1007/s00530-016-0506-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-modal tag localization for mobile video search

Abstract

Access this article

Similar content being viewed by others

IVIST: Interactive Video Search Tool in VBS 2022

Improving Micro-video Recommendation by Controlling Position Bias

DUT-WEBV: A Benchmark Dataset for Performance Evaluation of Tag Localization for Web Video

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-modal tag localization for mobile video search

Abstract

Access this article

Similar content being viewed by others

IVIST: Interactive Video Search Tool in VBS 2022

Improving Micro-video Recommendation by Controlling Position Bias

DUT-WEBV: A Benchmark Dataset for Performance Evaluation of Tag Localization for Web Video

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation