research-article

Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks

Authors:

Jiebo LuoAuthors Info & Claims

MMSports'18: Proceedings of the 1st International Workshop on Multimedia Content Analysis in Sports

Pages 77 - 85

https://doi.org/10.1145/3265845.3265851

Published: 19 October 2018 Publication History

Abstract

Sports video captioning is a task of automatically generating a textual description for sports events (e.g. football, basketball or volleyball games). Although previous works have shown promising performance in producing the coarse and general description of a video, it is still quite challenging to caption a sports video with multiple fine-grained player's actions and complex group relationship among players. In this paper, we present a novel hierarchical recurrent neural network (RNN) based framework with an attention mechanism for sports video captioning. A motion representation module is proposed to extract individual pose attribute and group-level trajectory cluster information. Moreover, we introduce a new dataset called Sports Video Captioning Dataset-Volleyball for evaluation. We evaluate our proposed model over two public datasets and our new dataset, and the experimental results demonstrate that our method outperforms the state-of-the-art methods.

References

[1]

Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et almbox. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, Vol. 16. 265--283.

Digital Library

[2]

Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In CVPR. IEEE.

[3]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR . IEEE.

[4]

Guilhem Chéron, Ivan Laptev, and Cordelia Schmid. 2015. P-cnn: Pose-based cnn features for action recognition. In ICCV. IEEE.

[5]

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint (2014).

[6]

Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation . 376--380.

[7]

Yinpeng Dong, Hang Su, Jun Zhu, and Bo Zhang. 2017. Improving interpretability of deep neural networks with semantic information. arXiv preprint (2017).

[8]

Ling-Yu Duan, Min Xu, Tat-Seng Chua, Qi Tian, and Chang-Sheng Xu. 2003. A mid-level representation framework for semantic sports video analysis. In MM. ACM.

Digital Library

[9]

Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Daps: Deep action proposals for action understanding. In ECCV. Springer.

[10]

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In CVPR. IEEE.

[11]

Ross Girshick. 2015. Fast R-CNN. In ICCV . IEEE.

Digital Library

[12]

Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In CVPR. IEEE.

[13]

Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. IEEE.

Digital Library

[14]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[15]

Chiori Hori, Takaaki Hori, Teng-Yok Lee, Kazuhiro Sumi, John R Hershey, and Tim K Marks. 2017. Attention-based multimodal fusion for video description. arXiv preprint (2017).

[16]

Mostafa S Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. 2016. A hierarchical deep temporal model for group activity recognition. In CVPR. IEEE.

[17]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In CVPR . IEEE.

Digital Library

[18]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint (2014).

[19]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV. IEEE.

[20]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS .

Digital Library

[21]

John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).

[22]

Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 605.

Digital Library

[23]

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR . IEEE.

[24]

Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS .

[25]

Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In ECCV. Springer.

[26]

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016b. Hierarchical recurrent neural encoder for video representation with application to captioning. In CVPR. IEEE.

[27]

Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016a. Jointly modeling embedding and translation to bridge video and language. In CVPR . IEEE.

[28]

Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In CVPR. IEEE.

[29]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics.

Digital Library

[30]

Ramakanth Pasunuru and Mohit Bansal. 2017. Multi-task video captioning with video and entailment generation. arXiv preprint (2017).

[31]

Mengshi Qi, Jie Qin, Annan Li, Yunhong Wang, Jiebo Luo, and Luc Van Gool. 2018. stagNet: An Attentive Semantic RNN for Group Activity Recognition. In ECCV. Springer.

[32]

Mengshi Qi, Yunhong Wang, and Annan Li. 2017. Online Cross-Modal Scene Retrieval by Binary Representation and Semantic Graph. In MM. ACM.

Digital Library

[33]

Michalis Raptis, Iasonas Kokkinos, and Stefano Soatto. 2012. Discovering discriminative action parts from mid-level video representations. In CVPR. IEEE.

Digital Library

[34]

Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In ICCV. IEEE.

Digital Library

[35]

Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. 2017. Weakly supervised dense video captioning. In CVPR. IEEE.

[36]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint (2014).

[37]

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint (2012).

[38]

Kevin Tang, Bangpeng Yao, Li Fei-Fei, and Daphne Koller. 2013. Combining the right features for complex event recognition. In ICCV. IEEE.

Digital Library

[39]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV. IEEE.

Digital Library

[40]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. IEEE.

[41]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In ICCV. IEEE.

Digital Library

[42]

Heng Wang, Alexander Kl"aser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In CVPR. IEEE.

Digital Library

[43]

Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR . IEEE.

[44]

Xian Wu, Guanbin Li, Qingxing Cao, Qingge Ji, and Liang Lin. 2018. Interpretable Video Captioning via Trajectory Structured Localization. In CVPR. IEEE.

[45]

Changsheng Xu, Jinjun Wang, Kongwah Wan, Yiqun Li, and Lingyu Duan. 2006. Live sports event detection based on broadcast video and web-casting text. In MM. ACM.

Digital Library

[46]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR. IEEE.

[47]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015a. Show, attend and tell: Neural image caption generation with visual attention. In ICML .

Digital Library

[48]

Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso. 2015b. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In AAAI .

Digital Library

[49]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In ICCV. IEEE.

Digital Library

[50]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In CVPR. IEEE.

[51]

Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In CVPR . IEEE.

[52]

Guangyu Zhu, Qingming Huang, Changsheng Xu, Yong Rui, Shuqiang Jiang, Wen Gao, and Hongxun Yao. 2007. Trajectory based event tactics analysis in broadcast sports video. In MM. ACM.

Digital Library

Cited By

An QCui KLiu RWang CQi MMa HLienhart RMoeslund TSaito H(2022)Attention-Aware Multiple Granularities Network for Player Re-IdentificationProceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports10.1145/3552437.3555695(137-144)Online publication date: 14-Oct-2022
https://dl.acm.org/doi/10.1145/3552437.3555695
Prudviraj JReddy MVishnu CMohan C(2022)AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video DescriptionIEEE Transactions on Image Processing10.1109/TIP.2022.319564331(5559-5569)Online publication date: 2022
https://doi.org/10.1109/TIP.2022.3195643
Ahmed SSaif AHanif MShakil MJaman MHaque MShawkat SHasan JSonok BRahman FSabbir H(2021)Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual FormationApplied Sciences10.3390/app1201031712:1(317)Online publication date: 29-Dec-2021
https://doi.org/10.3390/app12010317
Show More Cited By

Index Terms

Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Activity recognition and understanding
        Scene understanding
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

Sports Video Captioning via Attentive Motion Representation and Group Relationship Modeling
Sports video captioning refers to the task of automatically generating a textual description for sports events (football, basketball, or volleyball games). Although a great deal of previous work has shown promising performance in producing a coarse and a ...
Player action recognition in broadcast tennis video with applications to semantic analysis of sports game
MM '06: Proceedings of the 14th ACM international conference on Multimedia

Recognition of player actions in broadcast sports video is a challenging task due to low resolution of the players in video frames. In this paper, we present a novel method to recognize the basic player actions in broadcast tennis video. Different from ...
Hierarchical Recurrent Neural Network for Video Summarization
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Exploiting the temporal dependency among video frames or subshots is very important for the task of video summarization. Practically, RNN is good at temporal dependency modeling, and has achieved overwhelming performance in many video-based tasks, such ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MMSports'18: Proceedings of the 1st International Workshop on Multimedia Content Analysis in Sports

October 2018

110 pages

ISBN:9781450359818

DOI:10.1145/3265845

General Chairs:
Rainer Lienhart
University of Augsburg, Germany
,
Thomas B. Moeslund
Aalborg University, Denmark
,
Hideo Saito
Keio University, Japan
,
Rainer Lienhart
University of Augsburg, Germany

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Science Foundation Award
National Natural Science Foundation of China

Conference

MM '18

Sponsor:

SIGMM

MM '18: ACM Multimedia Conference

October 26, 2018

Seoul, Republic of Korea

Acceptance Rates

MMSports'18 Paper Acceptance Rate 12 of 23 submissions, 52%;

Overall Acceptance Rate 29 of 49 submissions, 59%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
359
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

An QCui KLiu RWang CQi MMa HLienhart RMoeslund TSaito H(2022)Attention-Aware Multiple Granularities Network for Player Re-IdentificationProceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports10.1145/3552437.3555695(137-144)Online publication date: 14-Oct-2022
https://dl.acm.org/doi/10.1145/3552437.3555695
Prudviraj JReddy MVishnu CMohan C(2022)AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video DescriptionIEEE Transactions on Image Processing10.1109/TIP.2022.319564331(5559-5569)Online publication date: 2022
https://doi.org/10.1109/TIP.2022.3195643
Ahmed SSaif AHanif MShakil MJaman MHaque MShawkat SHasan JSonok BRahman FSabbir H(2021)Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual FormationApplied Sciences10.3390/app1201031712:1(317)Online publication date: 29-Dec-2021
https://doi.org/10.3390/app12010317
Qi MQin JHuang DShen ZYang YLuo JShen HZhuang YSmith JYang YCesar PMetze FPrabhakaran B(2021)Latent Memory-augmented Graph Transformer for Visual StorytellingProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475236(4892-4901)Online publication date: 17-Oct-2021
https://dl.acm.org/doi/10.1145/3474085.3475236
Qi MQin JZhen XHuang DYang YLuo JWen Chen CCucchiara RHua XQi GRicci EZhang ZZimmermann R(2020)Few-Shot Ensemble Learning for Video Classification with SlowFast Memory NetworksProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3416269(3007-3015)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3394171.3416269
Kim BChoi YHung CCerny TShin DBechini A(2020)Automatic baseball commentary generation using deep learningProceedings of the 35th Annual ACM Symposium on Applied Computing10.1145/3341105.3374063(1056-1065)Online publication date: 30-Mar-2020
https://dl.acm.org/doi/10.1145/3341105.3374063
Qi MWang YLi ALuo J(2020)Sports Video Captioning via Attentive Motion Representation and Group Relationship ModelingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2019.292165530:8(2617-2633)Online publication date: 3-Aug-2020
https://dl.acm.org/doi/10.1109/TCSVT.2019.2921655
Qi MWang YQin JLi ALuo JVan Gool L(2020)stagNet: An Attentive Semantic RNN for Group Activity and Individual Action RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2019.289416130:2(549-565)Online publication date: Feb-2020
https://doi.org/10.1109/TCSVT.2019.2894161

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten