Rethinking Fusion Baselines for Multi-modal Human Action Recognition

Jiang, Hongda; Li, Yanghao; Song, Sijie; Liu, Jiaying

doi:10.1007/978-3-030-00764-5_17

Hongda Jiang¹⁸,
Yanghao Li¹⁸,
Sijie Song¹⁸ &
…
Jiaying Liu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11166))

Included in the following conference series:

Pacific Rim Conference on Multimedia

3378 Accesses
3 Citations

Abstract

In this paper we study fusion baselines for multi-modal action recognition. Our work explores different strategies for multiple stream fusion. First, we consider the early fusion which fuses the different modal inputs by directly stacking them along the channel dimension. Second, we analyze the late fusion scheme of fusing the scores from different modal streams. Then, the middle fusion scheme in different aggregation stages is explored. Besides, a modal transformation module is developed to adaptively exploit the complementary information from various modal data. We give comprehensive analysis of fusion schemes described above through experimental results and hope our work could benefit the community in multi-modal action recognition.

This work was supported by National Natural Science Foundation of China under contract No. 61772043 and CCF-Tencent Open Research Fund.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Research on Diverse Feature Fusion Network Based on Video Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Article 28 October 2019

Cmf-transformer: cross-modal fusion transformer for human action recognition

Article 17 August 2024

References

Chéron, G., Laptev, I., Schmid, C.: P-CNN: pose-based CNN features for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3218–3226 (2015)
Google Scholar
Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Proceedings of European Conference on Computer Vision, pp. 428–441 (2006)
Chapter Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
Google Scholar
Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: Proceedings of Advances in Neural Information Processing Systems, pp. 3468–3476 (2016)
Google Scholar
Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal multiplier networks for video action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 7445–7454 (2017)
Google Scholar
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of ACM International Conference on Multimedia, pp. 675–678 (2014)
Google Scholar
Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High dynamic range video. ACM Trans. Graph. 22, 319–325 (2003)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)
Article Google Scholar
Liu, J., Li, Y., Song, S., Xing, J., Lan, C., Zeng, W.: Multi-modality multi-task recurrent neural network for online action detection. IEEE Trans. Circ. Syst. Video Technol. (2018)
Google Scholar
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Chapter Google Scholar
Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of Advances in Neural Information Processing Systems, pp. 568–576 (2014)
Google Scholar
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI, vol. 1, p. 7 (2017)
Google Scholar
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Trans. Image Process. 27(7), 3459–3471 (2018)
Article MathSciNet Google Scholar
Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: Proceedings of International Conference on Machine Learning, pp. 843–852 (2015)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3169–3176 (2011)
Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of European Conference on Computer Vision, pp. 20–36 (2016)
Chapter Google Scholar
Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2117–2126 (2017)
Google Scholar
Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T.: Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2923–2932 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science and Technology, Peking University, Beijing, China
Hongda Jiang, Yanghao Li, Sijie Song & Jiaying Liu

Authors

Hongda Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Yanghao Li
View author publications
You can also search for this author in PubMed Google Scholar
Sijie Song
View author publications
You can also search for this author in PubMed Google Scholar
Jiaying Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiaying Liu .

Editor information

Editors and Affiliations

Hefei University of Technology, Hefei, China
Richang Hong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
University of Tokyo, Tokyo, Japan
Toshihiko Yamasaki
Hefei University of Technology, Hefei, China
Meng Wang
City University of Hong Kong, Hong Kong, Hong Kong
Chong-Wah Ngo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, H., Li, Y., Song, S., Liu, J. (2018). Rethinking Fusion Baselines for Multi-modal Human Action Recognition. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11166. Springer, Cham. https://doi.org/10.1007/978-3-030-00764-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-00764-5_17
Published: 18 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00763-8
Online ISBN: 978-3-030-00764-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Rethinking Fusion Baselines for Multi-modal Human Action Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Research on Diverse Feature Fusion Network Based on Video Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Cmf-transformer: cross-modal fusion transformer for human action recognition

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Rethinking Fusion Baselines for Multi-modal Human Action Recognition

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Research on Diverse Feature Fusion Network Based on Video Action Recognition

Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition

Cmf-transformer: cross-modal fusion transformer for human action recognition

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation