research-article

Share-and-Chat: Achieving Human-Level Video Commenting by Search and Multi-View Embedding

Authors:

Yong RuiAuthors Info & Claims

MM '16: Proceedings of the 24th ACM international conference on Multimedia

Pages 928 - 937

https://doi.org/10.1145/2964284.2964320

Published: 01 October 2016 Publication History

Abstract

Video has become a predominant social media for the booming live interactions. Automatic generation of emotional comments to a video has great potential to significantly increase user engagement in many socio-video applications (e.g., chat bot). Nevertheless, the problem of video commenting has been overlooked by the research community. The major challenges are that the generated comments are to be not only as natural as those from human beings, but also relevant to the video content. We present in this paper a novel two-stage deep learning-based approach to automatic video commenting. Our approach consists of two components. The first component, similar video search, efficiently finds the visually similar videos w.r.t. a given video using approximate nearest-neighbor search based on the learned deep video representations, while the second dynamic ranking effectively ranks the comments associated with the searched similar videos by learning a deep multi-view embedding space. For modeling the emotional view of videos, we incorporate visual sentiment, video content, and text comments into the learning of the embedding space. On a newly collected dataset with over 102K videos and 10.6M comments, we demonstrate that our approach outperforms several state-of-the-art methods and achieves human-level video commenting.

References

[1]

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In ICCV, 2015.

Digital Library

[2]

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.

[3]

D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In ACM MM, 2013.

Digital Library

[4]

T. Chen, D. Borth, T. Darrell, and S.-F. Chang. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. arXiv preprint arXiv:1410.8586, 2014.

[5]

Y.-Y. Chen, T. Chen, W. H. Hsu, H.-Y. M. Liao, and S.-F. Chang. Predicting viewer affective comments based on image content in social media. In ICMR, 2014.

Digital Library

[6]

N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000.

[7]

J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015.

[8]

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.

[9]

A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In ECCV. 2010.

Digital Library

[10]

Y. Gong, Q. Ke, M. Isard, and S. Lazebnik. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, 106(2):210--233, 2014.

Digital Library

[11]

S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, 2013.

Digital Library

[12]

D. R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639--2664, 2004.

Digital Library

[13]

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735--1780, 1997.

Digital Library

[14]

C. J. Hutto and E. Gilbert. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In AAAI Conference on Weblogs and Social Media, 2014.

[15]

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.

Digital Library

[16]

M. A. Kheirbek and R. Hen. Dorsal vs ventral hippocampal neurogenesis: implications for cognition and mood. Neuropsychopharmacology, 36(1):373, 2011.

[17]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.

Digital Library

[18]

G. A. Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39--41, 1995.

Digital Library

[19]

Y. Pan, Y. Li, T. Yao, T. Mei, H. Li, and Y. Rui. Learning deep intrinsic video representation by exploring temporal coherence and graph structure. In IJCAI, 2016.

[20]

Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, 2016.

[21]

G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, and H.-J. Zhang. Correlative multi-label video annotation. In ACM MM, 2007.

Digital Library

[22]

J. P. Romano. On the behavior of randomization tests without a group invariance assumption. Journal of the American Statistical Association, 85(411):686--692, 1990.

[23]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211--252, 2015.

Digital Library

[24]

S. Siersdorfer, J. S. Pedro, and M. Sanderson. Automatic video tagging using content redundancy. In SIGIR, 2009.

Digital Library

[25]

K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.

Digital Library

[26]

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

[27]

I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.

Digital Library

[28]

J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R. J. Mooney. Integrating language and vision to generate natural language descriptions of videos in the wild. In COLING, volume 2, page 9, 2014.

[29]

D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3d: generic features for video analysis. In ICCV, 2015.

[30]

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.

[31]

J. Wang and S. Li. Query-driven iterated neighborhood graph search for large scale indexing. In ACM MM, 2012.

Digital Library

[32]

M. Wang, X.-S. Hua, J. Tang, and R. Hong. Beyond distance measurement: constructing neighborhood similarity for video annotation. IEEE Trans. on MM, 11(3):465--476, 2009.

Digital Library

[33]

J. Xu, T. Mei, T. Yao, and Y. Rui. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016.

[34]

T. Yao, F. Long, T. Mei, and Y. Rui. Deep semantic-preserving and ranking-based hashing for image retrieval. In IJCAI, 2016.

Digital Library

[35]

T. Yao, T. Mei, and C.-W. Ngo. Learning query and image similarities with ranking canonical correlation analysis. In ICCV, 2015.

Digital Library

[36]

T. Yao, T. Mei, C.-W. Ngo, and S. Li. Annotation for free: Video tagging by mining user search behavior. In ACM MM, 2013.

Digital Library

[37]

G. Ye, Y. Li, H. Xu, D. Liu, and S.-F. Chang. Eventnet: A large scale structured concept library for complex event detection in video. In ACM MM, 2015.

Digital Library

Cited By

Zhang MLuo GMa YLi SQian ZZhang XEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)VCMaster: Generating Diverse and Fluent Live Video Comments Based on Multimodal ContextsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612078(4688-4696)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612078
Xin BXu NZhai YZhang TLu ZLiu JNie WLi XLiu A(2023)A comprehensive survey on deep-learning-based visual captioningMultimedia Systems10.1007/s00530-023-01175-x29:6(3781-3804)Online publication date: 21-Sep-2023
https://doi.org/10.1007/s00530-023-01175-x
Mei TZhang WYao T(2020)Vision and language: from visual perception to content creationAPSIPA Transactions on Signal and Information Processing10.1017/ATSIP.2020.109:1Online publication date: 2020
https://doi.org/10.1017/ATSIP.2020.10
Show More Cited By

Index Terms

Share-and-Chat: Achieving Human-Level Video Commenting by Search and Multi-View Embedding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Video ChatBot: Triggering Live Social Interactions by Automatic Video Commenting
MM '16: Proceedings of the 24th ACM international conference on Multimedia

We demonstrate a video chatbot, which can generate human-level emotional comments referring to the videos shared by users and trigger a conversation with users. Our video chatbot performs a large-scale similar video search to find visually similar ...
See and chat: automatically generating viewer-level comments on images

Image is becoming a predominant medium for social interactions. Automatically expressing opinions on an image, which we refer to as image commenting, has great potential to improve user engagement and thus becomes an emerging yet very challenging ...
Sentimental Visual Captioning using Multimodal Transformer
Abstract
We propose a new task called sentimental visual captioning that generates captions with the inherent sentiment reflected by the input image or video. Compared with the stylized visual captioning task that requires a predefined style independent of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '16: Proceedings of the 24th ACM international conference on Multimedia

October 2016

1542 pages

ISBN:9781450336031

DOI:10.1145/2964284

General Chairs:
Alan Hanjalic
Delft University of Technology
,
Cees Snoek
Qualcomm Research Netherlands / University of Amsterdam
,
Marcel Worring
University of Amsterdam
,
Moderator:
Dick Bulterman
CWI / VU University Amsterdam
,
Program Chairs:
Benoit Huet
EURECOM
,
Aisling Kelliher
Virginia Tech
,
Yiannis Kompatsiaris
CERTH-ITI
,
Jin Li
Microsoft

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '16

Sponsor:

SIGMM

MM '16: ACM Multimedia Conference

October 15 - 19, 2016

Amsterdam, The Netherlands

Acceptance Rates

MM '16 Paper Acceptance Rate 52 of 237 submissions, 22%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
328
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)2

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang MLuo GMa YLi SQian ZZhang XEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)VCMaster: Generating Diverse and Fluent Live Video Comments Based on Multimodal ContextsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612078(4688-4696)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612078
Xin BXu NZhai YZhang TLu ZLiu JNie WLi XLiu A(2023)A comprehensive survey on deep-learning-based visual captioningMultimedia Systems10.1007/s00530-023-01175-x29:6(3781-3804)Online publication date: 21-Sep-2023
https://doi.org/10.1007/s00530-023-01175-x
Mei TZhang WYao T(2020)Vision and language: from visual perception to content creationAPSIPA Transactions on Signal and Information Processing10.1017/ATSIP.2020.109:1Online publication date: 2020
https://doi.org/10.1017/ATSIP.2020.10
Li YPan YYao TChao HRui YMei T(2019)Learning Click-Based Deep Structure-Preserving Embeddings with Visual AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/332899415:3(1-19)Online publication date: 8-Aug-2019
https://dl.acm.org/doi/10.1145/3328994
Zhang WYao TZhu SSaddik A(2019)Deep Learning–Based Multimedia AnalyticsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/327995215:1s(1-26)Online publication date: 24-Jan-2019
https://dl.acm.org/doi/10.1145/3279952
Liu WHuang XCao GZhang JSong GYang L(2019)Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classificationMultimedia Tools and Applications10.1007/s11042-019-08147-2Online publication date: 17-Dec-2019
https://doi.org/10.1007/s11042-019-08147-2
Chen JYao TChao H(2019)See and chatMultimedia Tools and Applications10.1007/s11042-018-5746-678:3(2689-2702)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.1007/s11042-018-5746-6
Rudra SThangavel S(2019)A Robust Q-Learning and Differential Evolution Based Policy Framework for Key Frame ExtractionIntelligent Computing, Information and Control Systems10.1007/978-3-030-30465-2_79(716-728)Online publication date: 19-Oct-2019
https://doi.org/10.1007/978-3-030-30465-2_79
Mohamad Nezami ODras MWan SParis CHamey L(2019)Towards Generating Stylized Image Captions via Adversarial TrainingPRICAI 2019: Trends in Artificial Intelligence10.1007/978-3-030-29908-8_22(270-284)Online publication date: 23-Aug-2019
https://doi.org/10.1007/978-3-030-29908-8_22
Chen TZhang ZYou QFang CWang ZJin HLuo J(2018)“Factual” or “Emotional”: Stylized Image Captioning with Adaptive Learning and AttentionComputer Vision – ECCV 201810.1007/978-3-030-01249-6_32(527-543)Online publication date: 6-Oct-2018
https://doi.org/10.1007/978-3-030-01249-6_32
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten