Natural Language Description of Surveillance Events

Ahmed, Sk. Arif; Dogra, Debi Prosad; Kar, Samarjit; Roy, Partha Pratim

doi:10.1007/978-981-10-7590-2_10

Natural Language Description of Surveillance Events

Sk. Arif Ahmed¹⁹,
Debi Prosad Dogra²⁰,
Samarjit Kar¹⁹ &
…
Partha Pratim Roy²¹

Conference paper
First Online: 09 May 2018

344 Accesses
2 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 699))

Abstract

This paper presents a novel method to represent hours of surveillance video in a pattern-based text log. We present a tag and template-based technique that automatically generates natural language descriptions of surveillance events. We combine the output of some of the existing object tracker, deep learning guided object and action classifiers, and graph-based scene knowledge to assign hierarchical tags and generate natural language description of surveillance events. Unlike some state-of-the-art image and short video descriptor methods, our approach can describe videos, specifically surveillance videos by combining frame-level, temporal-level, and behavior-level target tags/features. We evaluate our method against two baseline video descriptors, and our analysis suggests that supervised scene knowledge and template can improve video descriptions, specially in surveillance videos.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Aradhye, H., Toderici, G., Yagnik, J.: Video2text: learning to annotate video content. In: IEEE International Conference on Data Mining Workshops, 2009. ICDMW’09, pp. 144–151. IEEE (2009)
Google Scholar
Chen, X., Lawrence, Z.C.: Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422–2431 (2015)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Google Scholar
Dogra, D.P., Ahmed, A., Bhaskar, H.: Smart video summarization using mealy machine-based trajectory modelling for surveillance applications. Multimed. Tools Appl. 75(11), 6373–6401 (2016)
Article Google Scholar
Donahue, J., Anne, H.L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
Google Scholar
Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712–2719 (2013)
Google Scholar
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)
Article Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
MathSciNet MATH Google Scholar
Huang, H., Lu, Y., Zhang, F., Sun, S.: A multi-modal clustering method for web videos. In: International Conference on Trustworthy Computing and Services, pp. 163–169. Springer (2012)
Chapter Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying Visual-semantic Embeddings with Multimodal Neural Language Models (2014). arXiv:1411.2539
Krishnamoorthy, N., Malkarnenkar, G., Mooney, R.J., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: AAAI, vol. 1, p. 2 (2013)
Google Scholar
Kuznetsova, P., Ordonez, V., Berg, T.L., Choi, Y.: Treetalk: composition and compression of trees for image descriptions. TACL 2(10), 351–362 (2014)
Google Scholar
Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.C., Lee, J.T., Mukherjee, S., Aggarwal, J., Lee, H., Davis, L., et al.: A large-scale benchmark dataset for event recognition in surveillance video. In: IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 3153–3160. IEEE (2011)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3212 (2015)
Google Scholar
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 433–440 (2013)
Google Scholar
Vedantam, R., Lawrence, Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks (2014). arXiv:1412.4729
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Google Scholar
Wei, S., Zhao, Y., Zhu, Z., Liu, N.: Multimodal fusion for video searchreranking. IEEE Trans. Knowl. Data Eng. 22(8), 1191–1199 (2010)
Article Google Scholar
Welch, G., Bishop, G.: An introduction to the Kalman filter. In: Annual Conference Computer Graphics Interactions Technology, pp. 12–17. ACM (2001)
Google Scholar
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Google Scholar
Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, vol. 2, pp. 28–31. IEEE (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

National Institute of Technology Durgapur, Durgapur, 713209, WB, India
Sk. Arif Ahmed & Samarjit Kar
Indian Institute of Technology Bhubaneswar, Bhubaneswar, 751013, Odisha, India
Debi Prosad Dogra
Indian Institute of Technology Roorkee, Roorkee, 247667, Uttarakhand, India
Partha Pratim Roy

Authors

Sk. Arif Ahmed
View author publications
You can also search for this author in PubMed Google Scholar
Debi Prosad Dogra
View author publications
You can also search for this author in PubMed Google Scholar
Samarjit Kar
View author publications
You can also search for this author in PubMed Google Scholar
Partha Pratim Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sk. Arif Ahmed .

Editor information

Editors and Affiliations

Department of Mathematics and Statistics, Indian Institute of Technology Kanpur, Kanpur, Uttar Pradesh, India
Peeyush Chandra
Department of Computer Science and Engineering, Haldia Institute of Technology, Haldia, West Bengal, India
Debasis Giri
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Fagen Li
Department of Mathematics, National Institute of Technology Durgapur, Durgapur, West Bengal, India
Samarjit Kar
Department of Applied Sciences, Haldia Institute of Technology, Haldia, West Bengal, India
Dipak Kumar Jana

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ahmed, S.A., Dogra, D.P., Kar, S., Roy, P.P. (2019). Natural Language Description of Surveillance Events. In: Chandra, P., Giri, D., Li, F., Kar, S., Jana, D. (eds) Information Technology and Applied Mathematics. Advances in Intelligent Systems and Computing, vol 699. Springer, Singapore. https://doi.org/10.1007/978-981-10-7590-2_10

Download citation

DOI: https://doi.org/10.1007/978-981-10-7590-2_10
Published: 09 May 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7589-6
Online ISBN: 978-981-10-7590-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics