Abstract
Since large-scale surveillance videos always contain complex visual events, how to generate video descriptions effectively and efficiently without human supervision has become mandatory. To address this problem, we propose a novel architecture for jointly recognizing multiple events in a given surveillance video, motivated by the sequence to sequence network. The proposed architecture can predict what happens in a video directly without the preprocessing of object detection and tracking. We evaluate several variants of the proposed architecture with different visual features on a novel dataset perpared by our group. Moreover, we compute a wide range of quantitative metrics to evaluate this architecture. We further compare it to the popular Support Vector Machine-based visual event detection method. The comparison results suggest that the proposal method can outperform the traditional computer vision pipelines for visual event detection.








Similar content being viewed by others
References
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR 1409:0473
Benfold B, Reid ID (2011) Stable multi-target tracking in real-time surveillance video. In: IEEE Conference on computer vision and pattern recognition, pp 3457–3464
Chang C, Lin C (2011) LIBSVM: A library for support vector machines. ACM TIST 2(3):27:1–27:27
Cheng Z, Shen J (2016) On very large scale test collection for landmark image search benchmarking. Signal Process 124:13–26
Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder-decoder approaches. In: Proceedings of eighth workshop on syntax, semantics and structure in statistical translation, pp 103–111
Cho K, van Merrienboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, pp 1724–1734
Chu W, Song Y, Jaimes A (2015) Video co-summarization: Video summarization by visual co-occurrence. In: IEEE Conference on computer vision and pattern recognition, pp 3584–3592
Collins RT, Biernacki C, Celeux G, Lipton AJ, Govaert G, Kanade T (2000) Introduction to the special section on video surveillance. IEEE Trans Pattern Anal Mach Intell 22(8):745–746
Dai J, Li Y, He K, Sun J (2016) R-fcn: Object detection via region-based fully convolutional networks. In: Advances in neural information processing systems, pp 379–387
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE Conference on computer vision and pattern recognition, pp 886–893
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. Lect Notes Comput Sci 3952:428–441
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691
Fan C, Crandall DJ (2016) Deepdiary: Automatically captioning lifelogging image streams. In: European conference on computer vision workshops, pp 459–473
Fujiyoshi H, Lipton AJ, Kanade T (2004) Real-time human motion analysis by image skeletonization. IEICE, Transactions 87-D(1):113–120
Gao Z, Zhang H, Xu GP, Xue YB, Hauptmann AG (2015) Multi-view discriminative and structured dictionary learning with group sparsity for human action recognition. Signal Process 112:83–97
Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2013) Deep convolutional ranking for multilabel image annotation. arXiv:1312.4894
Guo J, Ren T, Bei J, Zhu Y (2015) Salient object detection in RGB-d image based on saliency fusion and propagation. In: International conference on internet multimedia computing and service, pp 59:1–59:5
Gutchess D, Trajkovic M, Cohen-Solal E, Lyons DM, Jain AK (2001) A background model initialization algorithm for video surveillance. In: ICCV, pp 733–740
He X, Gao M, Kan M, Wang D (2017) Birank: Towards ranking on bipartite graphs. IEEE Trans Knowl Data Eng 29(1):57–71
Hochreiter S, Schmidhuber J (1996) LSTM can solve hard long time lag problems. In: Advances in neural information processing systems, pp 473–479
Ibrahim MS, Muralidharan S, Deng Z, Vahdat A, Mori G (2016) A hierarchical deep temporal model for group activity recognition. In: IEEE Conference on computer vision and pattern recognition, pp 1971–1980
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: IEEE Conference on computer vision and pattern recognition, pp 4565–4574
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li F (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on computer vision and pattern recognition, pp 1725–1732
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Lan T, Wang Y, Yang W, Robinovitch SN, Mori G (2012) Discriminative latent models for recognizing contextual group activities. IEEE Trans Pattern Anal Mach Intell 34(8):1549–1562
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on computer vision and pattern recognition, pp 2169–2178
Lee D (2005) Effective gaussian mixture learning for video background subtraction. IEEE Trans Pattern Anal Mach Intell 27(5):827–832
Li J, Wong Y, Kankanhalli MS (2016) Multi-stream deep learning framework for automated presentation assessment. In: IEEE International symposium on multimedia, pp 222–225
Liu A, Su Y, Jia P, Gao Z, Hao T, Yang Z (2015) Multipe/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybernetics 45(6):1194–1208
Liu A, Su Y, Nie W, Kankanhalli MS (2017) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans Pattern Anal Mach Intell 39(1):102–114
Liu A, Xu N, Nie W, Su Y, Wong Y, Kankanhalli M (2017) Benchmarking a multi-modal & multi-view & interactive dataset for human action recognition. IEEE Trans Cybern
Ma S, Bargal SA, Zhang J, Sigal L, Sclaroff S (2017) Do less and achieve more: Training CNNs for action recognition utilizing action images from the web. Pattern Recognition. https://doi.org/10.1016/j.patcog.2017.01.027
Money AG, Agius HW (2008) Video summarisation: A conceptual framework and survey of the state of the art. J Vis Commun Image Represent 19(2):121–143
Nie L, Wang M, Zha Z, Chua T (2012) Oracle in image search: A content-based approach to performance prediction. ACM Trans Inf Syst 30(2):13:1–13:23
Peng X, Wang L, Wang X, Qiao Y (2016) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Comput Vis Image Underst 150:109–125
Pers J, Sulic V, Kristan M, Perse M, Polanec K, Kovacic S (2010) Histograms of optical flow for efficient representation of body motion. Pattern Recogn Lett 31(11):1369–1376
Pritch Y, Ratovitch S, Hendel A, Peleg S (2009) Clustered synopsis of surveillance video. In: IEEE International conference on advanced video and signal based surveillance, pp 195–200
Qian Y, Bi M, Tan T, Yu K (2016) Very deep convolutional neural networks for noise robust speech recognition. IEEE/ACM Trans Audio, Speech & Language Processing 24(12):2263–2276
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems
Truong BT, Venkatesh S (2007) Video abstraction: A systematic review and classification. TOMCCAP 3(1):3
Tu K, Meng M, Lee MW, Choe TE, Zhu SC (2014) Joint video and text parsing for understanding events and answering queries. IEEE Multimedia 21(2):42–70
Venugopalan S, Rohrbach M, Donahue J, Mooney RJ, Darrell T, Saenko K (2015) Sequence to sequence - video to text. In: IEEE International conference on computer vision, pp 4534–4542
Venugopalan S, Hendricks LA, Mooney RJ, Saenko K (2016) Improving lstm-based video description with linguistic knowledge mined from text. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 1961–1966
Wang L, Hu W, Tan T (2003) Recent developments in human motion analysis. Pattern Recogn 36(3):585–601
Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Encode, review and decode: Reviewer module for caption generation
Yeung S, Fathi A, Fei-Fei L (2014) Videoset: video summary evaluation through text. In: CVPR Egocentric vision workshop
Zhang H, Shang X, Luan H, Wang M, Chua T (2016) Learning from collective intelligence: Feature learning using social images and tags. ACM Trans Multimed Comput Commun Appl 13(1):1:1–1:23
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (61772359, 61472275, 61572356), the Tianjin Research Program of Application Foundation and Advanced Technology (15JCYBJC16200), the National Research Foundation, Prime Minister Office, Singapore under its International Research Centre in Singapore Funding Initiative.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, AA., Shao, Z., Wong, Y. et al. LSTM-based multi-label video event detection. Multimed Tools Appl 78, 677–695 (2019). https://doi.org/10.1007/s11042-017-5532-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-5532-x