Skip to main content

Transformer Networks for Future Person Localization in First-Person Videos

  • Conference paper
  • First Online:
  • 540 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13599))

Abstract

Reliably and accurately forecasting future trajectories of pedestrians is necessary for systems like autonomous vehicles or visual assistive devices to function correctly. While previous state-of-the-art methods relied on modeling social interactions with LSTMs, with videos captured with a static camera from a bird’s-eye view, our paper presents a new method that leverages the Transformers architecture and offers a reliable way to model future trajectories in first-person videos captured by a body-mounted camera, without having to model any social interactions. Accurately forecasting future trajectories is a challenging task, mainly due to how unpredictably humans move. We tackle this issue by using information about target persons’ previous locations, scales, and dynamic poses, as well as information about the camera wearer’s ego-motion. The model we propose predicts future trajectories in a simple way, modeling each target’s trajectory separately, without the use of complex social interactions between humans or interactions between targets and the scene. Experimental results show that our method overall outperforms previous state-of-the-art methods, and yields better results in challenging situations where previous state-of-the-art methods fail.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–971 (2016)

    Google Scholar 

  2. Becker, S., Hug, R., Hübner, W., Arens, M.: An evaluation of trajectory prediction approaches and notes on the TrajNet benchmark. arXiv preprint arXiv:1805.07663 (2018)

  3. Becker, S., Hug, R., Hubner, W., Arens, M.: RED: a simple but effective baseline predictor for the TrajNet benchmark. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops (2018)

    Google Scholar 

  4. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7) (2011)

    Google Scholar 

  5. Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: ICCV (2017)

    Google Scholar 

  6. Giuliari, F., Hasan, I., Cristani, M., Galasso, F.: Transformer networks for trajectory forecasting. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 10335–10342. IEEE (2021)

    Google Scholar 

  7. Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social GAN: socially acceptable trajectories with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2255–2264 (2018)

    Google Scholar 

  8. Hoshen, Y., Peleg, S.: An egocentric look at video photographer identity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4284–4292 (2016)

    Google Scholar 

  9. Ivanovic, B., Pavone, M.: The trajectron: probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2375–2384 (2019)

    Google Scholar 

  10. Kacorri, H., Kitani, K.M., Bigham, J.P., Asakawa, C.: People with visual impairment training personal object recognizers: feasibility and challenges. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 5839–5849 (2017)

    Google Scholar 

  11. Kosaraju, V., Sadeghian, A., Martín-Martín, R., Reid, I., Rezatofighi, H., Savarese, S.: Social-BiGAT: multimodal trajectory forecasting using bicycle-GAN and graph attention networks. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  12. Leung, T.S., Medioni, G.: Visual navigation aid for the blind in dynamic environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 565–572 (2014)

    Google Scholar 

  13. Li, C., Kitani, K.M.: Pixel-level hand detection in ego-centric videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3570–3577 (2013)

    Google Scholar 

  14. Lugaresi, C., et, al.: MediaPipe: a framework for building perception pipelines. CoRR abs/1906.08172 (2019). https://arxiv.org/abs/1906.08172

  15. Luo, W., Yang, B., Urtasun, R.: Fast and furious: real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3569–3577 (2018)

    Google Scholar 

  16. Morris, B.T., Trivedi, M.M.: A survey of vision-based trajectory learning and analysis for surveillance. IEEE Trans. Circuits Syst. Video Technol. 18(8), 1114–1127 (2008)

    Article  Google Scholar 

  17. Park, H.S., Hwang, J.J., Niu, Y., Shi, J.: Egocentric future localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4697–4705 (2016)

    Google Scholar 

  18. Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2847–2854. IEEE (2012)

    Google Scholar 

  19. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015)

    Google Scholar 

  20. Sadeghian, A., Alahi, A., Savarese, S.: Tracking the untrackable: learning to track multiple cues with long-term dependencies. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 300–311 (2017)

    Google Scholar 

  21. Salzmann, T., Ivanovic, B., Chakravarty, P., Pavone, M.: Trajectron++: multi-agent generative trajectory forecasting with heterogeneous data for control. ArXiv abs/2001.03093 (2020)

    Google Scholar 

  22. Schöller, C., Aravantinos, V., Lay, F., Knoll, A.: What the constant velocity model can teach us about pedestrian motion prediction. IEEE Robot. Autom. Lett. 5(2), 1696–1703 (2020)

    Article  Google Scholar 

  23. Styles, O., Ross, A., Sanchez, V.: Forecasting pedestrian trajectory with machine-annotated training data. In: 2019 IEEE Intelligent Vehicles Symposium (IV), pp. 716–721. IEEE (2019)

    Google Scholar 

  24. Su, H., Zhu, J., Dong, Y., Zhang, B.: Forecast the plausible paths in crowd scenes. In: IJCAI, vol. 1, p. 2 (2017)

    Google Scholar 

  25. Tang, T.J., Li, W.H.: An assistive eyewear prototype that interactively converts 3D object locations into spatial audio. In: Proceedings of the 2014 ACM International Symposium on Wearable Computers, pp. 119–126 (2014)

    Google Scholar 

  26. Tian, Y., Liu, Y., Tan, J.: Wearable navigation system for the blind people in dynamic environments. In: 2013 IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems, pp. 153–158. IEEE (2013)

    Google Scholar 

  27. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  28. Yagi, T., Mangalam, K., Yonetani, R., Sato, Y.: Future person localization in first-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7593–7602 (2018)

    Google Scholar 

  29. Zhang, M., Teck Ma, K., Hwee Lim, J., Zhao, Q., Feng, J.: Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4372–4381 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hideo Saito .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alikadic, A., Saito, H., Hachiuma, R. (2022). Transformer Networks for Future Person Localization in First-Person Videos. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2022. Lecture Notes in Computer Science, vol 13599. Springer, Cham. https://doi.org/10.1007/978-3-031-20716-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20716-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20715-0

  • Online ISBN: 978-3-031-20716-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics