skip to main content
10.1145/3265845.3265851acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks

Published: 19 October 2018 Publication History

Abstract

Sports video captioning is a task of automatically generating a textual description for sports events (e.g. football, basketball or volleyball games). Although previous works have shown promising performance in producing the coarse and general description of a video, it is still quite challenging to caption a sports video with multiple fine-grained player's actions and complex group relationship among players. In this paper, we present a novel hierarchical recurrent neural network (RNN) based framework with an attention mechanism for sports video captioning. A motion representation module is proposed to extract individual pose attribute and group-level trajectory cluster information. Moreover, we introduce a new dataset called Sports Video Captioning Dataset-Volleyball for evaluation. We evaluate our proposed model over two public datasets and our new dataset, and the experimental results demonstrate that our method outperforms the state-of-the-art methods.

References

[1]
Mart'in Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et almbox. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, Vol. 16. 265--283.
[2]
Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In CVPR. IEEE.
[3]
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR . IEEE.
[4]
Guilhem Chéron, Ivan Laptev, and Cordelia Schmid. 2015. P-cnn: Pose-based cnn features for action recognition. In ICCV. IEEE.
[5]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint (2014).
[6]
Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation . 376--380.
[7]
Yinpeng Dong, Hang Su, Jun Zhu, and Bo Zhang. 2017. Improving interpretability of deep neural networks with semantic information. arXiv preprint (2017).
[8]
Ling-Yu Duan, Min Xu, Tat-Seng Chua, Qi Tian, and Chang-Sheng Xu. 2003. A mid-level representation framework for semantic sports video analysis. In MM. ACM.
[9]
Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Daps: Deep action proposals for action understanding. In ECCV. Springer.
[10]
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In CVPR. IEEE.
[11]
Ross Girshick. 2015. Fast R-CNN. In ICCV . IEEE.
[12]
Georgia Gkioxari and Jitendra Malik. 2015. Finding action tubes. In CVPR. IEEE.
[13]
Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2013. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV. IEEE.
[14]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[15]
Chiori Hori, Takaaki Hori, Teng-Yok Lee, Kazuhiro Sumi, John R Hershey, and Tim K Marks. 2017. Attention-based multimodal fusion for video description. arXiv preprint (2017).
[16]
Mostafa S Ibrahim, Srikanth Muralidharan, Zhiwei Deng, Arash Vahdat, and Greg Mori. 2016. A hierarchical deep temporal model for group activity recognition. In CVPR. IEEE.
[17]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In CVPR . IEEE.
[18]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint (2014).
[19]
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV. IEEE.
[20]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS .
[21]
John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. (2001).
[22]
Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 605.
[23]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR . IEEE.
[24]
Alejandro Newell, Zhiao Huang, and Jia Deng. 2017. Associative embedding: End-to-end learning for joint detection and grouping. In NIPS .
[25]
Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In ECCV. Springer.
[26]
Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016b. Hierarchical recurrent neural encoder for video representation with application to captioning. In CVPR. IEEE.
[27]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016a. Jointly modeling embedding and translation to bridge video and language. In CVPR . IEEE.
[28]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In CVPR. IEEE.
[29]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics.
[30]
Ramakanth Pasunuru and Mohit Bansal. 2017. Multi-task video captioning with video and entailment generation. arXiv preprint (2017).
[31]
Mengshi Qi, Jie Qin, Annan Li, Yunhong Wang, Jiebo Luo, and Luc Van Gool. 2018. stagNet: An Attentive Semantic RNN for Group Activity Recognition. In ECCV. Springer.
[32]
Mengshi Qi, Yunhong Wang, and Annan Li. 2017. Online Cross-Modal Scene Retrieval by Binary Representation and Semantic Graph. In MM. ACM.
[33]
Michalis Raptis, Iasonas Kokkinos, and Stefano Soatto. 2012. Discovering discriminative action parts from mid-level video representations. In CVPR. IEEE.
[34]
Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In ICCV. IEEE.
[35]
Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. 2017. Weakly supervised dense video captioning. In CVPR. IEEE.
[36]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint (2014).
[37]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint (2012).
[38]
Kevin Tang, Bangpeng Yao, Li Fei-Fei, and Daphne Koller. 2013. Combining the right features for complex event recognition. In ICCV. IEEE.
[39]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In ICCV. IEEE.
[40]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. IEEE.
[41]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. In ICCV. IEEE.
[42]
Heng Wang, Alexander Kl"aser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by dense trajectories. In CVPR. IEEE.
[43]
Limin Wang, Yu Qiao, and Xiaoou Tang. 2015. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR . IEEE.
[44]
Xian Wu, Guanbin Li, Qingxing Cao, Qingge Ji, and Liang Lin. 2018. Interpretable Video Captioning via Trajectory Structured Localization. In CVPR. IEEE.
[45]
Changsheng Xu, Jinjun Wang, Kongwah Wan, Yiqun Li, and Lingyu Duan. 2006. Live sports event detection based on broadcast video and web-casting text. In MM. ACM.
[46]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR. IEEE.
[47]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015a. Show, attend and tell: Neural image caption generation with visual attention. In ICML .
[48]
Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso. 2015b. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework. In AAAI .
[49]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In ICCV. IEEE.
[50]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In CVPR. IEEE.
[51]
Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In CVPR . IEEE.
[52]
Guangyu Zhu, Qingming Huang, Changsheng Xu, Yong Rui, Shuqiang Jiang, Wen Gao, and Hongxun Yao. 2007. Trajectory based event tactics analysis in broadcast sports video. In MM. ACM.

Cited By

View all
  • (2022)Attention-Aware Multiple Granularities Network for Player Re-IdentificationProceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports10.1145/3552437.3555695(137-144)Online publication date: 14-Oct-2022
  • (2022)AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video DescriptionIEEE Transactions on Image Processing10.1109/TIP.2022.319564331(5559-5569)Online publication date: 2022
  • (2021)Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual FormationApplied Sciences10.3390/app1201031712:1(317)Online publication date: 29-Dec-2021
  • Show More Cited By

Index Terms

  1. Sports Video Captioning by Attentive Motion Representation based Hierarchical Recurrent Neural Networks
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          MMSports'18: Proceedings of the 1st International Workshop on Multimedia Content Analysis in Sports
          October 2018
          110 pages
          ISBN:9781450359818
          DOI:10.1145/3265845
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 19 October 2018

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. motion representation
          2. pose attribute
          3. sports video captioning
          4. video analysis
          5. volleyball

          Qualifiers

          • Research-article

          Funding Sources

          Conference

          MM '18
          Sponsor:
          MM '18: ACM Multimedia Conference
          October 26, 2018
          Seoul, Republic of Korea

          Acceptance Rates

          MMSports'18 Paper Acceptance Rate 12 of 23 submissions, 52%;
          Overall Acceptance Rate 29 of 49 submissions, 59%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)13
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 05 Mar 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2022)Attention-Aware Multiple Granularities Network for Player Re-IdentificationProceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports10.1145/3552437.3555695(137-144)Online publication date: 14-Oct-2022
          • (2022)AAP-MIT: Attentive Atrous Pyramid Network and Memory Incorporated Transformer for Multisentence Video DescriptionIEEE Transactions on Image Processing10.1109/TIP.2022.319564331(5559-5569)Online publication date: 2022
          • (2021)Att-BiL-SL: Attention-Based Bi-LSTM and Sequential LSTM for Describing Video in the Textual FormationApplied Sciences10.3390/app1201031712:1(317)Online publication date: 29-Dec-2021
          • (2021)Latent Memory-augmented Graph Transformer for Visual StorytellingProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475236(4892-4901)Online publication date: 17-Oct-2021
          • (2020)Few-Shot Ensemble Learning for Video Classification with SlowFast Memory NetworksProceedings of the 28th ACM International Conference on Multimedia10.1145/3394171.3416269(3007-3015)Online publication date: 12-Oct-2020
          • (2020)Automatic baseball commentary generation using deep learningProceedings of the 35th Annual ACM Symposium on Applied Computing10.1145/3341105.3374063(1056-1065)Online publication date: 30-Mar-2020
          • (2020)Sports Video Captioning via Attentive Motion Representation and Group Relationship ModelingIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2019.292165530:8(2617-2633)Online publication date: 3-Aug-2020
          • (2020)stagNet: An Attentive Semantic RNN for Group Activity and Individual Action RecognitionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2019.289416130:2(549-565)Online publication date: Feb-2020

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media