LSTM-based multi-label video event detection

Liu, An-An; Shao, Zhuang; Wong, Yongkang; Li, Junnan; Su, Yu-Ting; Kankanhalli, Mohan

doi:10.1007/s11042-017-5532-x

LSTM-based multi-label video event detection

Published: 18 December 2017

Volume 78, pages 677–695, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

An-An Liu¹,
Zhuang Shao¹,
Yongkang Wong²,
Junnan Li³,
Yu-Ting Su¹ &
…
Mohan Kankanhalli⁴

1272 Accesses
30 Citations
Explore all metrics

Abstract

Since large-scale surveillance videos always contain complex visual events, how to generate video descriptions effectively and efficiently without human supervision has become mandatory. To address this problem, we propose a novel architecture for jointly recognizing multiple events in a given surveillance video, motivated by the sequence to sequence network. The proposed architecture can predict what happens in a video directly without the preprocessing of object detection and tracking. We evaluate several variants of the proposed architecture with different visual features on a novel dataset perpared by our group. Moreover, we compute a wide range of quantitative metrics to evaluate this architecture. We further compare it to the popular Support Vector Machine-based visual event detection method. The comparison results suggest that the proposal method can outperform the traditional computer vision pipelines for visual event detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convolutional LSTM Based Video Object Detection

Complex event detection via attention-based video representation and classification

Article 10 August 2017

Semi-supervised Deep Neural Networks for Object Detection in Video Surveillance Systems

References

Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR 1409:0473
Google Scholar
Benfold B, Reid ID (2011) Stable multi-target tracking in real-time surveillance video. In: IEEE Conference on computer vision and pattern recognition, pp 3457–3464
Chang C, Lin C (2011) LIBSVM: A library for support vector machines. ACM TIST 2(3):27:1–27:27
Google Scholar
Cheng Z, Shen J (2016) On very large scale test collection for landmark image search benchmarking. Signal Process 124:13–26
Article Google Scholar
Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder-decoder approaches. In: Proceedings of eighth workshop on syntax, semantics and structure in statistical translation, pp 103–111
Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1724–1734
Chu W, Song Y, Jaimes A (2015) Video co-summarization: Video summarization by visual co-occurrence. In: IEEE Conference on computer vision and pattern recognition, pp 3584–3592
Collins RT, Biernacki C, Celeux G, Lipton AJ, Govaert G, Kanade T (2000) Introduction to the special section on video surveillance. IEEE Trans Pattern Anal Mach Intell 22(8):745–746
Article Google Scholar
Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Conference on computer vision and pattern recognition, pp 886–893
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. Lect Notes Comput Sci 3952:428–441
Article Google Scholar
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691
Article Google Scholar
Fan C, Crandall DJ (2016) Deepdiary: Automatically captioning lifelogging image streams. In: European conference on computer vision workshops, pp 459–473
Fujiyoshi H, Lipton AJ, Kanade T (2004) Real-time human motion analysis by image skeletonization. IEICE, Transactions 87-D(1):113–120
Google Scholar
Gao Z, Zhang H, Xu GP, Xue YB, Hauptmann AG (2015) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112:83–97
Article Google Scholar
Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2013) Deep convolutional ranking for multilabel image annotation. arXiv:1312.4894
Guo J, Ren T, Bei J, Zhu Y (2015) Salient object detection in RGB-d image based on saliency fusion and propagation. In: International conference on internet multimedia computing and service, pp 59:1–59:5
Gutchess D, Trajkovic M, Cohen-Solal E, Lyons DM, Jain AK (2001) A background model initialization algorithm for video surveillance. In: ICCV, pp 733–740
He X, Gao M, Kan M, Wang D (2017) Birank: Towards ranking on bipartite graphs. IEEE Trans Knowl Data Eng 29(1):57–71
Article Google Scholar
Hochreiter S, Schmidhuber J (1996) LSTM can solve hard long time lag problems. In: Advances in neural information processing systems, pp 473–479
Ibrahim MS, Muralidharan S, Deng Z, Vahdat A, Mori G (2016) A hierarchical deep temporal model for group activity recognition. In: IEEE Conference on computer vision and pattern recognition, pp 1971–1980
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: IEEE Conference on computer vision and pattern recognition, pp 4565–4574
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li F (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on computer vision and pattern recognition, pp 1725–1732
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Lan T, Wang Y, Yang W, Robinovitch SN, Mori G (2012) Discriminative latent models for recognizing contextual group activities. IEEE Trans Pattern Anal Mach Intell 34(8):1549–1562
Article Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on computer vision and pattern recognition, pp 2169–2178
Lee D (2005) Effective gaussian mixture learning for video background subtraction. IEEE Trans Pattern Anal Mach Intell 27(5):827–832
Article Google Scholar
Li J, Wong Y, Kankanhalli MS (2016) Multi-stream deep learning framework for automated presentation assessment. In: IEEE International symposium on multimedia, pp 222–225
Liu A, Su Y, Jia P, Gao Z, Hao T, Yang Z (2015) Multipe/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybernetics 45(6):1194–1208
Article Google Scholar
Liu A, Su Y, Nie W, Kankanhalli MS (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114
Article Google Scholar
Liu A, Xu N, Nie W, Su Y, Wong Y, Kankanhalli M (2017) Benchmarking a multi-modal & multi-view & interactive dataset for human action recognition. IEEE Trans Cybern
Ma S, Bargal SA, Zhang J, Sigal L, Sclaroff S (2017) Do less and achieve more: Training CNNs for action recognition utilizing action images from the web. Pattern Recognition. https://doi.org/10.1016/j.patcog.2017.01.027
Money AG, Agius HW (2008) Video summarisation: A conceptual framework and survey of the state of the art. J Vis Commun Image Represent 19(2):121–143
Article Google Scholar
Nie L, Wang M, Zha Z, Chua T (2012) Oracle in image search: A content-based approach to performance prediction. ACM Trans Inf Syst 30(2):13:1–13:23
Article Google Scholar
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Article Google Scholar
Pers J, Sulic V, Kristan M, Perse M, Polanec K, Kovacic S (2010) Histograms of optical flow for efficient representation of body motion. Pattern Recogn Lett 31(11):1369–1376
Article Google Scholar
Pritch Y, Ratovitch S, Hendel A, Peleg S (2009) Clustered synopsis of surveillance video. In: IEEE International conference on advanced video and signal based surveillance, pp 195–200
Qian Y, Bi M, Tan T, Yu K (2016) Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans Audio, Speech & Language Processing 24(12):2263–2276
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems
Truong BT, Venkatesh S (2007) Video abstraction: A systematic review and classification. TOMCCAP 3(1):3
Article Google Scholar
Tu K, Meng M, Lee MW, Choe TE, Zhu SC (2014) Joint video and text parsing for understanding events and answering queries. IEEE Multimedia 21(2):42–70
Article Google Scholar
Venugopalan S, Rohrbach M, Donahue J, Mooney RJ, Darrell T, Saenko K (2015) Sequence to sequence - video to text. In: IEEE International conference on computer vision, pp 4534–4542
Venugopalan S, Hendricks LA, Mooney RJ, Saenko K (2016) Improving lstm-based video description with linguistic knowledge mined from text. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 1961–1966
Wang L, Hu W, Tan T (2003) Recent developments in human motion analysis. Pattern Recogn 36(3):585–601
Article Google Scholar
Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Encode, review and decode: Reviewer module for caption generation
Yeung S, Fathi A, Fei-Fei L (2014) Videoset: video summary evaluation through text. In: CVPR Egocentric vision workshop
Zhang H, Shang X, Luan H, Wang M, Chua T (2016) Learning from collective intelligence: Feature learning using social images and tags. ACM Trans Multimed Comput Commun Appl 13(1):1:1–1:23
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (61772359, 61472275, 61572356), the Tianjin Research Program of Application Foundation and Advanced Technology (15JCYBJC16200), the National Research Foundation, Prime Minister Office, Singapore under its International Research Centre in Singapore Funding Initiative.

Author information

Authors and Affiliations

School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China
An-An Liu, Zhuang Shao & Yu-Ting Su
Smart Systems Institute, National University of Singapore, Singapore, Singapore
Yongkang Wong
NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore, Singapore, Singapore
Junnan Li
School of Computing, National University of Singapore, Singapore, Singapore
Mohan Kankanhalli

Authors

An-An Liu
View author publications
You can also search for this author inPubMed Google Scholar
Zhuang Shao
View author publications
You can also search for this author inPubMed Google Scholar
Yongkang Wong
View author publications
You can also search for this author inPubMed Google Scholar
Junnan Li
View author publications
You can also search for this author inPubMed Google Scholar
Yu-Ting Su
View author publications
You can also search for this author inPubMed Google Scholar
Mohan Kankanhalli
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to An-An Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, AA., Shao, Z., Wong, Y. et al. LSTM-based multi-label video event detection. Multimed Tools Appl 78, 677–695 (2019). https://doi.org/10.1007/s11042-017-5532-x

Download citation

Received: 24 May 2017
Revised: 30 October 2017
Accepted: 11 December 2017
Published: 18 December 2017
Issue Date: January 2019
DOI: https://doi.org/10.1007/s11042-017-5532-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LSTM-based multi-label video event detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Convolutional LSTM Based Video Object Detection

Complex event detection via attention-based video representation and classification

Semi-supervised Deep Neural Networks for Object Detection in Video Surveillance Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now