Deep feature extraction and motion representation for satellite video scene classification

Gu, Yanfeng; Liu, Huan; Wang, Tengfei; Li, Shengyang; Gao, Guoming

doi:10.1007/s11432-019-2784-4

Deep feature extraction and motion representation for satellite video scene classification

Research Paper
Published: 09 March 2020

Volume 63, article number 140307, (2020)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

Yanfeng Gu¹,
Huan Liu¹,
Tengfei Wang¹,
Shengyang Li^2,3 &
…
Guoming Gao¹

278 Accesses
18 Citations
Explore all metrics

Abstract

Satellite video scene classification (SVSC) is an advanced topic in the remote sensing field, which refers to determine the video scene categories from satellite videos. SVSC is an important and fundamental step for satellite video analysis and understanding, which provides priors for the presence of objects and dynamic events. In this paper, a two-stage framework is proposed to extract spatial features and motion features for SVSC. More specifically, the first stage is designed to extract spatial features for satellite videos. Representative frames are firstly selected based on the blur detection and spatial activity of satellite videos. Then the fine-tuned visual geometry group network (VGG-Net) is transferred to extract spatial features based on spatial content. The second stage is designed to build motion representation for satellite videos. The motion representation of moving targets in satellite videos is first built by the second temporal principal component of principal component analysis (PCA). Second, features from the first fully connected layer of VGG-Net are used as high-level spatial representation for moving targets. Third, a small network of long and short term memory (LSTM) is further designed for encoding temporal information. Two-stage features respectively characterize spatial and temporal patterns of satellite scenes, which are finally fused for SVSC. A satellite video dataset is built for video scene classification, including 7209 video segments and covering 8 scene categories. These satellite videos are from Jilin-1 satellites and Urthecast. The experimental results show the efficiency of our proposed framework for SVSC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning models for digital image processing: a review

Article 07 January 2024

Convolutional neural network: a review of models, methodologies and applications to object detection

Article 20 December 2019

Transfer learning for image classification using VGG19: Caltech-101 image data set

Article 17 September 2021

References

Yan C, Xie H, Chen J, et al. A fast Uyghur text detector for complex background images. IEEE Trans Multimedia, 2018, 20: 3389–3398
Article Google Scholar
Wang Q Q, Huang Y, Jia W J, et al. FACLSTM: ConvLSTM with focused attention for scene text recognition. Sci China Inf Sci, 2020, 63: 120103
Article MathSciNet Google Scholar
Zhao J P, Guo W W, Zhang Z H, et al. A coupled convolutional neural network for small and densely clustered ship detection in SAR images. Sci China Inf Sci, 2019, 62: 042301
Article MathSciNet Google Scholar
Marszalek M, Laptev I, Schmid C. Actions in context. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, 2009. 2929–2936
Yan C, Tu Y, Wang X, et al. STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia, 2020, 22: 229–241
Article Google Scholar
Lazebnik S, Schmid C, Ponce J. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), New York, 2006. 2169–2178
Sánchez J, Perronnin F, Mensink T, et al. Image classification with the fisher vector: theory and practice. Int J Comput Vis, 2013, 105: 222–245
Article MathSciNet MATH Google Scholar
Cheriyadat A M. Unsupervised feature learning for aerial scene classification. IEEE Trans Geosci Remote Sens, 2014, 52: 439–451
Article Google Scholar
Yan C, Li L, Zhang C, et al. Cross-modality bridging and knowledge transferring for image understanding. IEEE Trans Multimedia, 2019, 21: 2675–2685
Article Google Scholar
Othman E, Bazi Y, Alajlan N, et al. Using convolutional features and a sparse autoencoder for land-use scene classification. Int J Remote Sens, 2016, 37: 2149–2167
Article Google Scholar
Otavio A B P, Nogueira K, dos Santos J A. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, 2015. 44–51
Hu F, Xia G S, Hu J, et al. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens, 2015, 7: 14680–14707
Article Google Scholar
Chaib S, Liu H, Gu Y, et al. Deep feature fusion for VHR remote sensing scene classification. IEEE Trans Geosci Remote Sens, 2017, 55: 4775–4784
Article Google Scholar
Li E, Xia J, Du P, et al. Integrating multilayer features of convolutional neural networks for remote sensing scene classification. IEEE Trans Geosci Remote Sens, 2017, 55: 5653–5665
Article Google Scholar
He N, Fang L, Li S, et al. Remote sensing scene classification using multilayer stacked covariance pooling. IEEE Trans Geosci Remote Sens, 2018, 56: 6899–6910
Article Google Scholar
Yi S, Pavlovic V. Spatio-temporal context modeling for BoW-based video classification. In: Proceedings of IEEE International Conference on Computer Vision Workshops (ICCVW), Sydney, 2013. 779–786
Zhao G Y, Ahonen T, Matas J, et al. Rotation-invariant image and video description with local binary pattern features. IEEE Trans Image Process, 2012, 21: 1465–1477
Article MathSciNet MATH Google Scholar
Scovanner P, Ali S, Shah M. A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia (ACMMM), Augsburg, 2007. 357–360
Derpanis K G, Lecce M, Daniilidis K, et al. Dynamic scene understanding: the role of orientation features in space and time in scene classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, 2012. 1306–1313
Wang H, Ullah M M, Klaser A, et al. Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference (BMVC), London, 2009. 1–11
Wang H, Kläser A, Schmid C, et al. Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis, 2013, 103: 60–79
Article MathSciNet Google Scholar
Wang H, Schmid C. Action recognition with improved trajectories. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), Sydney, 2013. 3551–3558
Karpathy A, Toderici G, Shetty S, et al. Large scale video classification with convolutional neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, 2014. 1725–1732
Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 2018. 6546–6555
Hara K, Kataoka H, Satoh Y. Learning spatio-temporal features with 3D residual networks for action recognition. In: Proceedings of IEEE International Conference on Computer Vision Workshop (ICCVW), Venice, 2017. 3154–3160
Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), Santiago, 2015. 4489–4497
Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of International Conference on Neural Information Processing Systems (NeurIPS), Quebec, 2014. 568–576
Donahue J, Hendricks L A, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 677–691
Article Google Scholar
Srivastava N, Mansimov E, Salakhutdinov R. Unsupervised learning of video representations using LSTMs. In: Proceedings of International Conference on Machine Learning (ICML), Lille, 2015. 843–852
Ng J Y, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: deep networks for video classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 2015. 4694–4702
Zhu L, Xu Z, Yang Y. Bidirectional multirate reconstruction for temporal modeling in videos. In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017. 1339–1348
Feichtenhofer C, Pinz A, Wildes R P. Temporal residual networks for dynamic scene recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017. 7435–7444
Simonyan K, Zisserman A. Very deep convolutional networks for large scale image recognition. In: Proceedings of International Conference on Learning Representations (ICLR), San Diego, 2015. 1–14
Liu T M, Zhang H J, Qi F H. A novel video key-frame-extraction algorithm based on perceived motion energy model. IEEE Trans Circ Syst Video Technol, 2003, 13: 1006–1013
Article Google Scholar
Sze K W, Lam K M, Qiu G P. A new key frame representation for video segment retrieval. IEEE Trans Circ Syst Video Technol, 2005, 15: 1148–1155
Article Google Scholar
Dufaux F. Key frame selection to represent a video. In: Proceedings of International Conference on Image Processing (ICIP), Vancouver, 2000. 275–278
Crete F, Dolmiere T, Ladret P, et al. The blur effect: perception and estimation with a new no-reference perceptual blur metric. In: Proceedings of SPIE, 2007. 64920I
Sahouria E, Zakhor A. Content analysis of video using principal components. IEEE Trans Circ Syst Video Technol, 1999, 9: 1290–1298
Article Google Scholar
Xia G S, Hu J, Hu F, et al. AID: a benchmark data set for performance evaluation of aerial scene classification. IEEE Trans Geosci Remote Sens, 2017, 55: 3965–3981
Article Google Scholar
Tuia D, Moser G, Le Saux B. 2016 IEEE GRSS data fusion contest: very high temporal resolution from space technical committees. IEEE Geosci Remote Sens Mag, 2016, 4: 46–48
Article Google Scholar
Farneback G. Two-frame motion estimation based on polynomial expansion. In: Proceedings of the 13th Scandinavian Conference on Image Analysis (SCIA), 2003. 363–370
KaewTraKulPong P, Bowden R. An improved adaptive background mixture model for real-time tracking with shadow detection. In: Proceedings of the 2nd European Workshop on Advanced Video Based Surveillance System, Boston, 2002. 135–144

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of Key International Cooperation (Grant No. 61720106002), Key Research and Development Project of Ministry of Science and Technology (Grant No. 2017YFC1405100), National Natural Science Foundation of China (Grant No. 61901141), and Fundamental Research Funds for the Central Universities (Grant No. HIT.HSRIF.2020010). The authors would like to thank the IEEE GRSS Image Analysis and Data Fusion Technical Committee for providing Urthecast satellite videos.

Author information

Authors and Affiliations

School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin, 150001, China
Yanfeng Gu, Huan Liu, Tengfei Wang & Guoming Gao
Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences, Beijing, 100094, China
Shengyang Li
The Key Laboratory of Space Utilization, Chinese Academy of Sciences, Beijing, 100094, China
Shengyang Li

Authors

Yanfeng Gu
View author publications
You can also search for this author in PubMed Google Scholar
Huan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tengfei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shengyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Guoming Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoming Gao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gu, Y., Liu, H., Wang, T. et al. Deep feature extraction and motion representation for satellite video scene classification. Sci. China Inf. Sci. 63, 140307 (2020). https://doi.org/10.1007/s11432-019-2784-4

Download citation

Received: 01 November 2019
Revised: 08 January 2020
Accepted: 09 February 2020
Published: 09 March 2020
DOI: https://doi.org/10.1007/s11432-019-2784-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep feature extraction and motion representation for satellite video scene classification

Abstract

Access this article

Similar content being viewed by others

Deep learning models for digital image processing: a review

Convolutional neural network: a review of models, methodologies and applications to object detection

Transfer learning for image classification using VGG19: Caltech-101 image data set

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Deep feature extraction and motion representation for satellite video scene classification

Abstract

Access this article

Similar content being viewed by others

Deep learning models for digital image processing: a review

Convolutional neural network: a review of models, methodologies and applications to object detection

Transfer learning for image classification using VGG19: Caltech-101 image data set

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation