Skip to main content

WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-Grained Spatial-Temporal Understanding

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15134))

Included in the following conference series:

  • 253 Accesses

Abstract

In this paper, we address the challenge of fine-grained video event understanding in traffic scenarios, vital for autonomous driving and safety. Traditional datasets focus on driver or vehicle behavior, often neglecting pedestrian perspectives. To fill this gap, we introduce the WTS dataset, highlighting detailed behaviors of both vehicles and pedestrians across over 1.2k video events in over hundreds traffic scenarios. WTS integrates diverse perspectives from vehicle ego and fixed overhead cameras in a vehicle-infrastructure cooperative environment, enriched with comprehensive textual descriptions and unique 3D Gaze data for a synchronized 2D/3D view, focusing on pedestrian analysis. We also provide annotations for 5k publicly sourced pedestrian-related traffic videos. Additionally, we introduce LLMScorer, an LLM-based evaluation metric to align inference captions with ground truth. Using WTS, we establish a benchmark for dense video-to-text tasks, exploring state-of-the-art Vision-Language Models with an instance-aware VideoLLM method as a baseline. WTS aims to advance fine-grained video event understanding, enhancing traffic safety and autonomous driving development. Dataset page: https://woven-visionai.github.io/wts-dataset-homepage/.

Work done while Ta Gu was an intern at Woven by Toyota.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://matterport.com/.

References

  1. Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation (2016)

    Google Scholar 

  2. Awad, G., et al.: TRECVID 2020: a comprehensive campaign for evaluating video retrieval tasks across multiple application domains. arXiv preprint arXiv:2104.13473 (2021)

  3. Bai, S., et al.: TouchStone: evaluating vision-language models by language models (2023)

    Google Scholar 

  4. Baid, A., et al.: GTSFM: Georgia tech structure from motion. https://github.com/borglab/gtsfm (2021)

  5. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Goldstein, J., Lavie, A., Lin, C.Y., Voss, C. (eds.) Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, pp. 65–72. Association for Computational Linguistics (2005). https://aclanthology.org/W05-0909

  6. Chen, D., Dolan, W.: Collecting highly parallel data for paraphrase evaluation. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 190–200. Association for Computational Linguistics (2011). https://aclanthology.org/P11-1020

  7. Chen, S., et al.: VALOR: vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345 (2023)

  8. Cheng, H.K., Schwing, A.G.: XMem: long-term video object segmentation with an Atkinson-Shiffrin memory model. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision – ECCV 2022. Springer, Cham (2022)

    Google Scholar 

  9. Heilbron, F.C., Victor Escorcia, B.G., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)

    Google Scholar 

  10. Fei, H., Ren, Y., Ji, D.: Improving text understanding via deep syntax-semantics communication. In: Findings (2020). https://api.semanticscholar.org/CorpusID:226283615

  11. Fu, J., Ng, S.K., Jiang, Z., Liu, P.: GPTScore: evaluate as you desire (2023)

    Google Scholar 

  12. Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human poseitioning system (HPS): 3D human pose estimation and self-localization in large scenes from body-mounted sensors. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2021)

    Google Scholar 

  13. Hu, Z., Yang, Y., Zhai, X., Yang, D., Zhou, B., Liu, J.: GFIE: a dataset and baseline for gaze-following from 2D to 3D in indoor environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8907–8916 (2023)

    Google Scholar 

  14. Huang, J.Y., Huang, K.H., Chang, K.W.: Disentangling semantics and syntax in sentence embeddings with pre-trained language models (2021)

    Google Scholar 

  15. Kellnhofer, P., Recasens, A., Stent, S., Matusik, W., Torralba, A.: Gaze360: physically unconstrained gaze estimation in the wild. In: IEEE International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  16. Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.: Textual explanations for self-driving vehicles. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 577–593. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_35

    Chapter  Google Scholar 

  17. Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)

  18. Krishna, K., Chang, Y., Wieting, J., Iyyer, M.: RankGen: improving text generation with large ranking models. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, pp. 199–232. Association for Computational Linguistics (2022). https://doi.org/10.18653/v1/2022.emnlp-main.15, https://aclanthology.org/2022.emnlp-main.15

  19. Li, K., et al.: MVBench: a comprehensive multi-modal video understanding benchmark (2024)

    Google Scholar 

  20. Li, Y., Song, Y., Cao, L., Tetreault, J., Goldberg, L., Jaimes, A., Luo, J.: TGIF: a new dataset and benchmark on animated gif description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4641–4650 (2016)

    Google Scholar 

  21. Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics (2004). https://aclanthology.org/W04-1013

  22. Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-ChatGPT: towards detailed video understanding via large vision and language models. arXiv:2306.05424 (2023)

  23. Malla, S., Choi, C., Dwivedi, I., Choi, J.H., Li, J.: DRAMA: joint risk localization and captioning in driving. In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, 2–7 January 2023, pp. 1043–1052. IEEE (2023). https://doi.org/10.1109/WACV56688.2023.00110

  24. Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: ICCV (2019)

    Google Scholar 

  25. Nonaka, S., Nobuhara, S., Nishino, K.: Dynamic 3D gaze from afar: deep gaze estimation from temporal eye-head-body coordination. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2192–2201 (2022)

    Google Scholar 

  26. Onishi, Hirofumi, H.T.K.R.I.H., Murase, T.: Analysis of pedestrian-fatality statistics in Japan and the US and vehicle-pedestrian communication for vehicle-pedestrian crash-warnings. Int. J. Autom. Eng. 9(4), 231–236 (2018)

    Google Scholar 

  27. Onkhar, V., Dodou, D., de Winter, J.: Evaluating the tobii pro glasses 2 and 3 in static and dynamic conditions. Behav. Res. Methods (2023). https://doi.org/10.3758/s13428-023-02173-7

    Article  Google Scholar 

  28. OpenAI: GPT-3.5 (2023). https://platform.openai.com/docs/models/gpt-3-5-turbo

  29. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Isabelle, P., Charniak, E., Lin, D. (eds.) Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. Association for Computational Linguistics (2002). https://doi.org/10.3115/1073083.1073135, https://aclanthology.org/P02-1040

  30. Pini, S., Cornia, M., Bolelli, F., Baraldi, L., Cucchiara, R.: M-VAD names: a dataset for video captioning with naming. Multimedia Tools Appl. 78(10), 14007–14027 (2019)

    Article  Google Scholar 

  31. Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)

    Google Scholar 

  32. Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

    Google Scholar 

  33. Oliveira dos Santos, G., Colombini, E.L., Avila, S.: CIDEr-R: robust consensus-based image description evaluation. In: Xu, W., Ritter, A., Baldwin, T., Rahimi, A. (eds.) Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pp. 351–360. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.wnut-1.39, https://aclanthology.org/2021.wnut-1.39

  34. Sigurdsson, G.A., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Actor and observer: joint modeling of first and third-person videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7396–7404 (2018)

    Google Scholar 

  35. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_31

    Chapter  Google Scholar 

  36. Sima, C., et al.: DriveLM: driving with graph visual question answering (2023)

    Google Scholar 

  37. Wang, T., et al.: Caption anything: Interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677 (2023)

  38. Wang, T., Zhang, R., Lu, Z., Zheng, F., Cheng, R., Luo, P.: End-to-end dense video captioning with parallel decoding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6847–6857 (2021)

    Google Scholar 

  39. Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., Wang, W.Y.: VATEX: a large-scale, high-quality multilingual dataset for video-and-language research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4581–4591 (2019)

    Google Scholar 

  40. Xu, H., et al.: mPLUG-2: a modularized multi-modal foundation model across text, image and video. ArXiv abs/2302.00402 (2023)

    Google Scholar 

  41. Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)

    Google Scholar 

  42. Xu, Z., et al.: DriveGPT4: interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412 (2023)

  43. Yang, A., et al.: Vid2Seq: large-scale pretraining of a visual language model for dense video captioning. In: CVPR (2023)

    Google Scholar 

  44. Yang, J., Gao, M., Li, Z., Gao, S., Wang, F., Zheng, F.: Track anything: segment anything meets videos (2023)

    Google Scholar 

  45. Yu, F., et al.: Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  46. Yuan, Y., et al.: Osprey: pixel understanding with visual instruction tuning (2023)

    Google Scholar 

  47. Zhang, H., Li, X., Bing, L.: Video-LLaMA: an instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quan Kong .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2502 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kong, Q. et al. (2025). WTS: A Pedestrian-Centric Traffic Video Dataset for Fine-Grained Spatial-Temporal Understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15134. Springer, Cham. https://doi.org/10.1007/978-3-031-73116-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73116-7_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73115-0

  • Online ISBN: 978-3-031-73116-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics