Skip to main content
Log in

Deep feature extraction and motion representation for satellite video scene classification

  • Research Paper
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

Satellite video scene classification (SVSC) is an advanced topic in the remote sensing field, which refers to determine the video scene categories from satellite videos. SVSC is an important and fundamental step for satellite video analysis and understanding, which provides priors for the presence of objects and dynamic events. In this paper, a two-stage framework is proposed to extract spatial features and motion features for SVSC. More specifically, the first stage is designed to extract spatial features for satellite videos. Representative frames are firstly selected based on the blur detection and spatial activity of satellite videos. Then the fine-tuned visual geometry group network (VGG-Net) is transferred to extract spatial features based on spatial content. The second stage is designed to build motion representation for satellite videos. The motion representation of moving targets in satellite videos is first built by the second temporal principal component of principal component analysis (PCA). Second, features from the first fully connected layer of VGG-Net are used as high-level spatial representation for moving targets. Third, a small network of long and short term memory (LSTM) is further designed for encoding temporal information. Two-stage features respectively characterize spatial and temporal patterns of satellite scenes, which are finally fused for SVSC. A satellite video dataset is built for video scene classification, including 7209 video segments and covering 8 scene categories. These satellite videos are from Jilin-1 satellites and Urthecast. The experimental results show the efficiency of our proposed framework for SVSC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Yan C, Xie H, Chen J, et al. A fast Uyghur text detector for complex background images. IEEE Trans Multimedia, 2018, 20: 3389–3398

    Article  Google Scholar 

  2. Wang Q Q, Huang Y, Jia W J, et al. FACLSTM: ConvLSTM with focused attention for scene text recognition. Sci China Inf Sci, 2020, 63: 120103

    Article  MathSciNet  Google Scholar 

  3. Zhao J P, Guo W W, Zhang Z H, et al. A coupled convolutional neural network for small and densely clustered ship detection in SAR images. Sci China Inf Sci, 2019, 62: 042301

    Article  MathSciNet  Google Scholar 

  4. Marszalek M, Laptev I, Schmid C. Actions in context. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, 2009. 2929–2936

  5. Yan C, Tu Y, Wang X, et al. STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans Multimedia, 2020, 22: 229–241

    Article  Google Scholar 

  6. Lazebnik S, Schmid C, Ponce J. Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), New York, 2006. 2169–2178

  7. Sánchez J, Perronnin F, Mensink T, et al. Image classification with the fisher vector: theory and practice. Int J Comput Vis, 2013, 105: 222–245

    Article  MathSciNet  MATH  Google Scholar 

  8. Cheriyadat A M. Unsupervised feature learning for aerial scene classification. IEEE Trans Geosci Remote Sens, 2014, 52: 439–451

    Article  Google Scholar 

  9. Yan C, Li L, Zhang C, et al. Cross-modality bridging and knowledge transferring for image understanding. IEEE Trans Multimedia, 2019, 21: 2675–2685

    Article  Google Scholar 

  10. Othman E, Bazi Y, Alajlan N, et al. Using convolutional features and a sparse autoencoder for land-use scene classification. Int J Remote Sens, 2016, 37: 2149–2167

    Article  Google Scholar 

  11. Otavio A B P, Nogueira K, dos Santos J A. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains? In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, 2015. 44–51

  12. Hu F, Xia G S, Hu J, et al. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens, 2015, 7: 14680–14707

    Article  Google Scholar 

  13. Chaib S, Liu H, Gu Y, et al. Deep feature fusion for VHR remote sensing scene classification. IEEE Trans Geosci Remote Sens, 2017, 55: 4775–4784

    Article  Google Scholar 

  14. Li E, Xia J, Du P, et al. Integrating multilayer features of convolutional neural networks for remote sensing scene classification. IEEE Trans Geosci Remote Sens, 2017, 55: 5653–5665

    Article  Google Scholar 

  15. He N, Fang L, Li S, et al. Remote sensing scene classification using multilayer stacked covariance pooling. IEEE Trans Geosci Remote Sens, 2018, 56: 6899–6910

    Article  Google Scholar 

  16. Yi S, Pavlovic V. Spatio-temporal context modeling for BoW-based video classification. In: Proceedings of IEEE International Conference on Computer Vision Workshops (ICCVW), Sydney, 2013. 779–786

  17. Zhao G Y, Ahonen T, Matas J, et al. Rotation-invariant image and video description with local binary pattern features. IEEE Trans Image Process, 2012, 21: 1465–1477

    Article  MathSciNet  MATH  Google Scholar 

  18. Scovanner P, Ali S, Shah M. A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia (ACMMM), Augsburg, 2007. 357–360

  19. Derpanis K G, Lecce M, Daniilidis K, et al. Dynamic scene understanding: the role of orientation features in space and time in scene classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, 2012. 1306–1313

  20. Wang H, Ullah M M, Klaser A, et al. Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference (BMVC), London, 2009. 1–11

  21. Wang H, Kläser A, Schmid C, et al. Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis, 2013, 103: 60–79

    Article  MathSciNet  Google Scholar 

  22. Wang H, Schmid C. Action recognition with improved trajectories. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), Sydney, 2013. 3551–3558

  23. Karpathy A, Toderici G, Shetty S, et al. Large scale video classification with convolutional neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, 2014. 1725–1732

  24. Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, 2018. 6546–6555

  25. Hara K, Kataoka H, Satoh Y. Learning spatio-temporal features with 3D residual networks for action recognition. In: Proceedings of IEEE International Conference on Computer Vision Workshop (ICCVW), Venice, 2017. 3154–3160

  26. Tran D, Bourdev L, Fergus R, et al. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of IEEE International Conference on Computer Vision (ICCV), Santiago, 2015. 4489–4497

  27. Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In: Proceedings of International Conference on Neural Information Processing Systems (NeurIPS), Quebec, 2014. 568–576

  28. Donahue J, Hendricks L A, Rohrbach M, et al. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 677–691

    Article  Google Scholar 

  29. Srivastava N, Mansimov E, Salakhutdinov R. Unsupervised learning of video representations using LSTMs. In: Proceedings of International Conference on Machine Learning (ICML), Lille, 2015. 843–852

  30. Ng J Y, Hausknecht M, Vijayanarasimhan S, et al. Beyond short snippets: deep networks for video classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, 2015. 4694–4702

  31. Zhu L, Xu Z, Yang Y. Bidirectional multirate reconstruction for temporal modeling in videos. In: Proceedings of IEEE conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017. 1339–1348

  32. Feichtenhofer C, Pinz A, Wildes R P. Temporal residual networks for dynamic scene recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, 2017. 7435–7444

  33. Simonyan K, Zisserman A. Very deep convolutional networks for large scale image recognition. In: Proceedings of International Conference on Learning Representations (ICLR), San Diego, 2015. 1–14

  34. Liu T M, Zhang H J, Qi F H. A novel video key-frame-extraction algorithm based on perceived motion energy model. IEEE Trans Circ Syst Video Technol, 2003, 13: 1006–1013

    Article  Google Scholar 

  35. Sze K W, Lam K M, Qiu G P. A new key frame representation for video segment retrieval. IEEE Trans Circ Syst Video Technol, 2005, 15: 1148–1155

    Article  Google Scholar 

  36. Dufaux F. Key frame selection to represent a video. In: Proceedings of International Conference on Image Processing (ICIP), Vancouver, 2000. 275–278

  37. Crete F, Dolmiere T, Ladret P, et al. The blur effect: perception and estimation with a new no-reference perceptual blur metric. In: Proceedings of SPIE, 2007. 64920I

  38. Sahouria E, Zakhor A. Content analysis of video using principal components. IEEE Trans Circ Syst Video Technol, 1999, 9: 1290–1298

    Article  Google Scholar 

  39. Xia G S, Hu J, Hu F, et al. AID: a benchmark data set for performance evaluation of aerial scene classification. IEEE Trans Geosci Remote Sens, 2017, 55: 3965–3981

    Article  Google Scholar 

  40. Tuia D, Moser G, Le Saux B. 2016 IEEE GRSS data fusion contest: very high temporal resolution from space technical committees. IEEE Geosci Remote Sens Mag, 2016, 4: 46–48

    Article  Google Scholar 

  41. Farneback G. Two-frame motion estimation based on polynomial expansion. In: Proceedings of the 13th Scandinavian Conference on Image Analysis (SCIA), 2003. 363–370

  42. KaewTraKulPong P, Bowden R. An improved adaptive background mixture model for real-time tracking with shadow detection. In: Proceedings of the 2nd European Workshop on Advanced Video Based Surveillance System, Boston, 2002. 135–144

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of Key International Cooperation (Grant No. 61720106002), Key Research and Development Project of Ministry of Science and Technology (Grant No. 2017YFC1405100), National Natural Science Foundation of China (Grant No. 61901141), and Fundamental Research Funds for the Central Universities (Grant No. HIT.HSRIF.2020010). The authors would like to thank the IEEE GRSS Image Analysis and Data Fusion Technical Committee for providing Urthecast satellite videos.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guoming Gao.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gu, Y., Liu, H., Wang, T. et al. Deep feature extraction and motion representation for satellite video scene classification. Sci. China Inf. Sci. 63, 140307 (2020). https://doi.org/10.1007/s11432-019-2784-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11432-019-2784-4

Keywords

Navigation