Transformer Networks for Future Person Localization in First-Person Videos

Alikadic, Amar; Saito, Hideo; Hachiuma, Ryo

doi:10.1007/978-3-031-20716-7_14

Transformer Networks for Future Person Localization in First-Person Videos

Conference paper
First Online: 10 December 2022

540 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13599))

Abstract

Reliably and accurately forecasting future trajectories of pedestrians is necessary for systems like autonomous vehicles or visual assistive devices to function correctly. While previous state-of-the-art methods relied on modeling social interactions with LSTMs, with videos captured with a static camera from a bird’s-eye view, our paper presents a new method that leverages the Transformers architecture and offers a reliable way to model future trajectories in first-person videos captured by a body-mounted camera, without having to model any social interactions. Accurately forecasting future trajectories is a challenging task, mainly due to how unpredictably humans move. We tackle this issue by using information about target persons’ previous locations, scales, and dynamic poses, as well as information about the camera wearer’s ego-motion. The model we propose predicts future trajectories in a simple way, modeling each target’s trajectory separately, without the use of complex social interactions between humans or interactions between targets and the scene. Experimental results show that our method overall outperforms previous state-of-the-art methods, and yields better results in challenging situations where previous state-of-the-art methods fail.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–971 (2016)
Google Scholar
Becker, S., Hug, R., Hübner, W., Arens, M.: An evaluation of trajectory prediction approaches and notes on the TrajNet benchmark. arXiv preprint arXiv:1805.07663 (2018)
Becker, S., Hug, R., Hubner, W., Arens, M.: RED: a simple but effective baseline predictor for the TrajNet benchmark. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)
Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7) (2011)
Google Scholar
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: ICCV (2017)
Google Scholar
Giuliari, F., Hasan, I., Cristani, M., Galasso, F.: Transformer networks for trajectory forecasting. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 10335–10342. IEEE (2021)
Google Scholar
Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social GAN: socially acceptable trajectories with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2255–2264 (2018)
Google Scholar
Hoshen, Y., Peleg, S.: An egocentric look at video photographer identity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4284–4292 (2016)
Google Scholar
Ivanovic, B., Pavone, M.: The trajectron: probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2375–2384 (2019)
Google Scholar
Kacorri, H., Kitani, K.M., Bigham, J.P., Asakawa, C.: People with visual impairment training personal object recognizers: feasibility and challenges. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 5839–5849 (2017)
Google Scholar
Kosaraju, V., Sadeghian, A., Martín-Martín, R., Reid, I., Rezatofighi, H., Savarese, S.: Social-BiGAT: multimodal trajectory forecasting using bicycle-GAN and graph attention networks. Adv. Neural Inf. Process. Syst. 32 (2019)
Google Scholar
Leung, T.S., Medioni, G.: Visual navigation aid for the blind in dynamic environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 565–572 (2014)
Google Scholar
Li, C., Kitani, K.M.: Pixel-level hand detection in ego-centric videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3570–3577 (2013)
Google Scholar
Lugaresi, C., et, al.: MediaPipe: a framework for building perception pipelines. CoRR abs/1906.08172 (2019). https://arxiv.org/abs/1906.08172
Luo, W., Yang, B., Urtasun, R.: Fast and furious: real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3569–3577 (2018)
Google Scholar
Morris, B.T., Trivedi, M.M.: A survey of vision-based trajectory learning and analysis for surveillance. IEEE Trans. Circuits Syst. Video Technol. 18(8), 1114–1127 (2008)
Article Google Scholar
Park, H.S., Hwang, J.J., Niu, Y., Shi, J.: Egocentric future localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4697–4705 (2016)
Google Scholar
Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2847–2854. IEEE (2012)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015)
Google Scholar
Sadeghian, A., Alahi, A., Savarese, S.: Tracking the untrackable: learning to track multiple cues with long-term dependencies. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 300–311 (2017)
Google Scholar
Salzmann, T., Ivanovic, B., Chakravarty, P., Pavone, M.: Trajectron++: multi-agent generative trajectory forecasting with heterogeneous data for control. ArXiv abs/2001.03093 (2020)
Google Scholar
Schöller, C., Aravantinos, V., Lay, F., Knoll, A.: What the constant velocity model can teach us about pedestrian motion prediction. IEEE Robot. Autom. Lett. 5(2), 1696–1703 (2020)
Article Google Scholar
Styles, O., Ross, A., Sanchez, V.: Forecasting pedestrian trajectory with machine-annotated training data. In: 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 716–721. IEEE (2019)
Google Scholar
Su, H., Zhu, J., Dong, Y., Zhang, B.: Forecast the plausible paths in crowd scenes. In: IJCAI, vol. 1, p. 2 (2017)
Google Scholar
Tang, T.J., Li, W.H.: An assistive eyewear prototype that interactively converts 3D object locations into spatial audio. In: Proceedings of the 2014 ACM International Symposium on Wearable Computers, pp. 119–126 (2014)
Google Scholar
Tian, Y., Liu, Y., Tan, J.: Wearable navigation system for the blind people in dynamic environments. In: 2013 IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems, pp. 153–158. IEEE (2013)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Yagi, T., Mangalam, K., Yonetani, R., Sato, Y.: Future person localization in first-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7593–7602 (2018)
Google Scholar
Zhang, M., Teck Ma, K., Hwee Lim, J., Zhao, Q., Feng, J.: Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4372–4381 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Keio University, 3-14-1 Hiyoshi Kohoku-ku, Yokohama, 223-8522, Japan
Amar Alikadic, Hideo Saito & Ryo Hachiuma

Authors

Amar Alikadic
View author publications
You can also search for this author in PubMed Google Scholar
Hideo Saito
View author publications
You can also search for this author in PubMed Google Scholar
Ryo Hachiuma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hideo Saito .

Editor information

Editors and Affiliations

University of Nevada, Reno, NV, USA
George Bebis
University of Illinois Urbana-Champaign, Urbana, IL, USA
Bo Li
National University of Singapore, Singapore, Singapore
Angela Yao
Microsoft Research Asia, Beijing, China
Yang Liu
University of Missouri, Columbia, MO, USA
Ye Duan
City University of Hong Kong, Kowloon, Hong Kong
Manfred Lau
Idaho National Laboratory, Idaho Falls, ID, USA
Rajiv Khadka
Salesforce, Seattle, WA, USA
Ana Crisan
Tufts University, Medford, MA, USA
Remco Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alikadic, A., Saito, H., Hachiuma, R. (2022). Transformer Networks for Future Person Localization in First-Person Videos. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2022. Lecture Notes in Computer Science, vol 13599. Springer, Cham. https://doi.org/10.1007/978-3-031-20716-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-20716-7_14
Published: 10 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20715-0
Online ISBN: 978-3-031-20716-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics