skip to main content
10.1145/2964284.2964320acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Share-and-Chat: Achieving Human-Level Video Commenting by Search and Multi-View Embedding

Published: 01 October 2016 Publication History

Abstract

Video has become a predominant social media for the booming live interactions. Automatic generation of emotional comments to a video has great potential to significantly increase user engagement in many socio-video applications (e.g., chat bot). Nevertheless, the problem of video commenting has been overlooked by the research community. The major challenges are that the generated comments are to be not only as natural as those from human beings, but also relevant to the video content. We present in this paper a novel two-stage deep learning-based approach to automatic video commenting. Our approach consists of two components. The first component, similar video search, efficiently finds the visually similar videos w.r.t. a given video using approximate nearest-neighbor search based on the learned deep video representations, while the second dynamic ranking effectively ranks the comments associated with the searched similar videos by learning a deep multi-view embedding space. For modeling the emotional view of videos, we incorporate visual sentiment, video content, and text comments into the learning of the embedding space. On a newly collected dataset with over 102K videos and 10.6M comments, we demonstrate that our approach outperforms several state-of-the-art methods and achieves human-level video commenting.

References

[1]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015.
[2]
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
[3]
D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In ACM MM, 2013.
[4]
T. Chen, D. Borth, T. Darrell, and S.-F. Chang. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv preprint arXiv:1410.8586, 2014.
[5]
Y.-Y. Chen, T. Chen, W. H. Hsu, H.-Y. M. Liao, and S.-F. Chang. Predicting viewer affective comments based on image content in social media. In ICMR, 2014.
[6]
N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000.
[7]
J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015.
[8]
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
[9]
A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV. 2010.
[10]
Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 106(2):210--233, 2014.
[11]
S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, 2013.
[12]
D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639--2664, 2004.
[13]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, 1997.
[14]
C. J. Hutto and E. Gilbert. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In AAAI Conference on Weblogs and Social Media, 2014.
[15]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
[16]
M. A. Kheirbek and R. Hen. Dorsal vs ventral hippocampal neurogenesis: implications for cognition and mood. Neuropsychopharmacology, 36(1):373, 2011.
[17]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
[18]
G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39--41, 1995.
[19]
Y. Pan, Y. Li, T. Yao, T. Mei, H. Li, and Y. Rui. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In IJCAI, 2016.
[20]
Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, 2016.
[21]
G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, and H.-J. Zhang. Correlative multi-label video annotation. In ACM MM, 2007.
[22]
J. P. Romano. On the behavior of randomization tests without a group invariance assumption. Journal of the American Statistical Association, 85(411):686--692, 1990.
[23]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211--252, 2015.
[24]
S. Siersdorfer, J. S. Pedro, and M. Sanderson. Automatic video tagging using content redundancy. In SIGIR, 2009.
[25]
K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
[26]
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[27]
I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
[28]
J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In COLING, volume 2, page 9, 2014.
[29]
D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3d: generic features for video analysis. In ICCV, 2015.
[30]
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
[31]
J. Wang and S. Li. Query-driven iterated neighborhood graph search for large scale indexing. In ACM MM, 2012.
[32]
M. Wang, X.-S. Hua, J. Tang, and R. Hong. Beyond distance measurement: constructing neighborhood similarity for video annotation. IEEE Trans. on MM, 11(3):465--476, 2009.
[33]
J. Xu, T. Mei, T. Yao, and Y. Rui. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016.
[34]
T. Yao, F. Long, T. Mei, and Y. Rui. Deep semantic-preserving and ranking-based hashing for image retrieval. In IJCAI, 2016.
[35]
T. Yao, T. Mei, and C.-W. Ngo. Learning query and image similarities with ranking canonical correlation analysis. In ICCV, 2015.
[36]
T. Yao, T. Mei, C.-W. Ngo, and S. Li. Annotation for free: Video tagging by mining user search behavior. In ACM MM, 2013.
[37]
G. Ye, Y. Li, H. Xu, D. Liu, and S.-F. Chang. Eventnet: A large scale structured concept library for complex event detection in video. In ACM MM, 2015.

Cited By

View all
  • (2023)VCMaster: Generating Diverse and Fluent Live Video Comments Based on Multimodal ContextsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612078(4688-4696)Online publication date: 26-Oct-2023
  • (2023)A comprehensive survey on deep-learning-based visual captioningMultimedia Systems10.1007/s00530-023-01175-x29:6(3781-3804)Online publication date: 21-Sep-2023
  • (2020)Vision and language: from visual perception to content creationAPSIPA Transactions on Signal and Information Processing10.1017/ATSIP.2020.109:1Online publication date: 2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '16: Proceedings of the 24th ACM international conference on Multimedia
October 2016
1542 pages
ISBN:9781450336031
DOI:10.1145/2964284
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep convolutional neural networks
  2. multi-view embedding
  3. video commenting
  4. visual sentiment analysis

Qualifiers

  • Research-article

Conference

MM '16
Sponsor:
MM '16: ACM Multimedia Conference
October 15 - 19, 2016
Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)2
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)VCMaster: Generating Diverse and Fluent Live Video Comments Based on Multimodal ContextsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612078(4688-4696)Online publication date: 26-Oct-2023
  • (2023)A comprehensive survey on deep-learning-based visual captioningMultimedia Systems10.1007/s00530-023-01175-x29:6(3781-3804)Online publication date: 21-Sep-2023
  • (2020)Vision and language: from visual perception to content creationAPSIPA Transactions on Signal and Information Processing10.1017/ATSIP.2020.109:1Online publication date: 2020
  • (2019)Learning Click-Based Deep Structure-Preserving Embeddings with Visual AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/332899415:3(1-19)Online publication date: 8-Aug-2019
  • (2019)Deep Learning–Based Multimedia AnalyticsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/327995215:1s(1-26)Online publication date: 24-Jan-2019
  • (2019)Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classificationMultimedia Tools and Applications10.1007/s11042-019-08147-2Online publication date: 17-Dec-2019
  • (2019)See and chatMultimedia Tools and Applications10.1007/s11042-018-5746-678:3(2689-2702)Online publication date: 1-Feb-2019
  • (2019)A Robust Q-Learning and Differential Evolution Based Policy Framework for Key Frame ExtractionIntelligent Computing, Information and Control Systems10.1007/978-3-030-30465-2_79(716-728)Online publication date: 19-Oct-2019
  • (2019)Towards Generating Stylized Image Captions via Adversarial TrainingPRICAI 2019: Trends in Artificial Intelligence10.1007/978-3-030-29908-8_22(270-284)Online publication date: 23-Aug-2019
  • (2018)“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and AttentionComputer Vision – ECCV 201810.1007/978-3-030-01249-6_32(527-543)Online publication date: 6-Oct-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media