skip to main content
10.1145/3664647.3680617acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

DenseTrack: Drone-Based Crowd Tracking via Density-Aware Motion-Appearance Synergy

Published: 28 October 2024 Publication History

Abstract

Drone-based crowd tracking faces difficulties in accurately identifying and monitoring objects from an aerial perspective, largely due to their small size and close proximity to each other, which complicates both localization and tracking. To address these challenges, we present the Density-aware Tracking (DenseTrack) framework. DenseTrack capitalizes on crowd counting to precisely determine object locations, blending visual and motion cues to improve the tracking of small-scale objects. It specifically addresses the problem of cross-frame motion to enhance tracking accuracy and dependability. DenseTrack employs crowd density estimates as anchors for exact object localization within video frames. These estimates are merged with motion and position information from the tracking network, with motion offsets serving as key tracking cues. Moreover, DenseTrack enhances the ability to distinguish small-scale objects using insights from the visual-language model, integrating appearance with motion cues. The framework utilizes the Hungarian algorithm to ensure the accurate matching of individuals across frames. Demonstrated on DroneCrowd dataset, our approach exhibits superior performance, confirming its effectiveness in scenarios captured by drones. Our code will be available at: https://github.com/Zebrabeast/DenseTrack.

Supplemental Material

MP4 File - DenseTrack: Drone-Based Crowd Tracking via Density-aware Motion-appearance Synergy
A brief introduction to DenseTrack: Drone-Based Crowd Tracking via Density-Aware Motion-Appearance Synergy

References

[1]
Nir Aharon, Roy Orfaig, and Ben-Zion Bobrovsky. 2022. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv:2206.14651 (2022).
[2]
Carlos Arteta, Victor S. Lempitsky, J. Alison Noble, and Andrew Zisserman. 2014. Interactive Object Counting. In Proc. Eur. Conf. Comput. Vis. 504--518.
[3]
Takanori Asanomi, Kazuya Nishimura, and Ryoma Bise. 2023. Multi-Frame Attention with Feature-Level Warping for Drone Crowd Tracking. In Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. 1664--1673.
[4]
Aniket Bera, Nico Galoppo, Dillon Sharlet, Adam T. Lake, and Dinesh Manocha. 2014. AdaPT: Real-Time Adaptive Pedestrian Tracking for Crowded Scenes. In Proc. IEEE Int. Conf. Robot. Autom. 1801--1808.
[5]
Alex Bewley, ZongYuan Ge, Lionel Ott, Fabio Tozeto Ramos, and Ben Upcroft. 2016. Simple Online and Realtime Tracking. In Proc. IEEE Int. Con. Image Process. 3464--3468.
[6]
Jinkun Cao, Jiangmiao Pang, Xinshuo Weng, Rawal Khirodkar, and Kris Kitani. 2023. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 9686--9696.
[7]
Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, and Limin Wang. 2023. SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes. In Proc. IEEE/CVF Int. Conf. Comput. Vis. 9887--9897.
[8]
Yunhao Du, Zhicheng Zhao, Yang Song, Yanyun Zhao, Fei Su, Tao Gong, and Hongying Meng. 2023. StrongSORT: Make DeepSORT Great Again. IEEE Trans. Multimedia, Vol. 25 (2023), 8725--8737.
[9]
Teng Fu, Xiaocong Wang, Haiyang Yu, Ke Niu, Bin Li, and Xiangyang Xue. 2023. DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions. In Proc. ACM Multimedia. 2734--2743.
[10]
Tao Han, Junyu Gao, Yuan Yuan, and Qi Wang. 2020. Focus on Semantic Consistency for Cross-Domain Crowd Understanding. In Proc. IEEE Int. Conf. Acoustics Speech Signal Process. 1848--1852.
[11]
Junya Hayashida, Kazuya Nishimura, and Ryoma Bise. 2020. MPM: Joint Representation of Motion and Position Map for Cell Tracking. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 3822--3831.
[12]
Weiming Hu, Xue Zhou, Wei Li, Wenhan Luo, Xiaoqin Zhang, and Stephen J. Maybank. 2013. Active Contour-Based Visual Tracking by Integrating Colors, Shapes, and Motions. IEEE Trans. Image Process., Vol. 22, 5 (2013), 1778--1792.
[13]
Kan Huang, Chunwei Tian, Jingyong Su, and Jerry Chun-Wei Lin. 2022. Transformer-based Cross Reference Network for video salient object detection. Pattern Recognit. Lett., Vol. 160 (2022), 122--127.
[14]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. Int. Conf. Learn. Represent. 1--11.
[15]
Louis Kratz and Ko Nishino. 2010. Tracking with Local Spatio-Temporal Motion Patterns in Extremely Crowded Scenes. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 693--700.
[16]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proc. Int. Conf. Mach. Learn. 19730--19742.
[17]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proc. Int. Conf. Mach. Learn. 12888--12900.
[18]
Yuhong Li, Xiaofan Zhang, and Deming Chen. 2018. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 1091--1100.
[19]
Dingkang Liang, Xiwu Chen, Wei Xu, Yu Zhou, and Xiang Bai. 2022. TransCrowd: Weakly-Supervised Crowd Counting with Transformers. Sci. China Inf. Sci., Vol. 65, 6 (2022), 1--14.
[20]
Dingkang Liang, Wei Xu, Yingying Zhu, and Yu Zhou. 2023. Focal Inverse Distance Transform Maps for Crowd Localization. IEEE Trans. Multimedia, Vol. 25 (2023), 6040--6052.
[21]
Weizhe Liu, Mathieu Salzmann, and Pascal Fua. 2019. Context-Aware Crowd Counting. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 5099--5108.
[22]
Wenhan Luo, Junliang Xing, Anton Milan, Xiaoqin Zhang, Wei Liu, and Tae-Kyun Kim. 2021. Multiple Object Tracking: A Literature Review. Artif. Intell., Vol. 293 (2021), 103448.
[23]
Gerard Maggiolino, Adnan Ahmad, Jinkun Cao, and Kris Kitani. 2023. Deep OC-Sort: Multi-Pedestrian Tracking by Adaptive Re-identification. In Proc. IEEE Int. Conf. Image Process. 3025--3029.
[24]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models from Natural Language Supervision. In Proc. Int. Conf. Mach. Learn. 8748--8763.
[25]
Weihong Ren, Xinchao Wang, Jiandong Tian, Yandong Tang, and Antoni B. Chan. 2021. Tracking-by-Counting: Using Network Flows on Crowd Density Maps for Tracking Multiple Targets. IEEE Trans. Image Process., Vol. 30 (2021), 1439--1452.
[26]
Usman Sajid, Xiangyu Chen, Hasan Sajid, Taejoon Kim, and Guanghui Wang. 2021. Audio-Visual Transformer Based Crowd Counting. In Proc. IEEE/CVF Int. Conf. Comput. Vis. Workshops. 2249--2259.
[27]
Jenny Seidenschwarz, Guillem Brasó, Victor Castro Serrano, Ismail Elezi, and Laura Leal-Taixé. 2023. Simple Cues Lead to a Strong Multi-Object Tracker. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 13813--13823.
[28]
Bing Shuai, Andrew G. Berneshawi, Xinyu Li, Davide Modolo, and Joseph Tighe. 2021. SiamMOT: Siamese Multi-Object Tracking. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 12372--12382.
[29]
ShiJie Sun, Naveed Akhtar, HuanSheng Song, ChaoYang Zhang, Jianxin Li, and Ajmal Mian. 2019. Benchmark Data and Method for Real-Time People Counting in Cluttered Scenes Using Depth Sensors. IEEE Trans. Intell. Transp. Syst., Vol. 20, 10 (2019), 3599--3612.
[30]
Zhihong Sun, Jun Chen, Chao Liang, Weijian Ruan, and Mithun Mukherjee. 2021. A Survey of Multiple Pedestrian Tracking Based on Tracking-by-Detection Framework. IEEE Trans. Circuits Syst. Video Technol. (2021), 1819--1833.
[31]
Zhihong Sun, Jun Chen, Mithun Mukherjee, Chao Liang, Weijian Ruan, and Zhigeng Pan. 2022. Online multiple object tracking based on fusing global and partial features. Neurocomputing (2022), 190--203.
[32]
Ramana Sundararaman, Cedric De Almeida Braga, Éric Marchand, and Julien Pettré. 2021. Tracking Pedestrian Heads in Dense Crowd. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 3865--3875.
[33]
Paul Voigtlaender, Michael Krause, Aljosa Osep, Jonathon Luiten, Berin Balachandar Gnana Sekar, Andreas Geiger, and Bastian Leibe. 2019. MOTS: Multi-Object Tracking and Segmentation. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 7942--7951.
[34]
Boyu Wang, Huidong Liu, Dimitris Samaras, and Minh Hoai Nguyen. 2020. Distribution Matching for Crowd Counting. In Adv. Neural Inf. Process. Syst. 1--13.
[35]
Jue Wang and Lorenzo Torresani. 2022. Deformable Video Transformer. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 14033--14042.
[36]
Mingjie Wang, Hao Cai, Xian-Feng Han, Jun Zhou, and Minglun Gong. 2023. STNet: Scale Tree Network with Multi-Level Auxiliator for Crowd Counting. IEEE Trans. Multimedia, Vol. 25 (2023), 2074--2084.
[37]
Longyin Wen, Dawei Du, Pengfei Zhu, Qinghua Hu, Qilong Wang, Liefeng Bo, and Siwei Lyu. 2021. Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 7812--7821.
[38]
Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple Online and Realtime Tracking with a Deep Association Metric. In Proc. IEEE Int. Conf. Image Process. 3645--3649.
[39]
Zhenbo Xu, Wei Yang, Wei Zhang, Xiao Tan, Huan Huang, and Liusheng Huang. 2022. Segment as Points for Efficient and Effective Online Multi-Object Tracking and Segmentation. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 44, 10, 6424--6437.
[40]
Zhaoyi Yan, Ruimao Zhang, Hongzhi Zhang, Qingfu Zhang, and Wangmeng Zuo. 2022. Crowd Counting Via Perspective-Guided Fractional-Dilation Convolution. IEEE Trans. Multimedia, Vol. 24 (2022), 2633--2647.
[41]
Fan Yang, Ryota Hinami, Yusuke Matsui, Steven Ly, and Shin'ichi Satoh. 2019. Efficient Image Retrieval via Decoupling Diffusion into Online and Offline Processing. In Proc. AAAI Conf. Artif. Intell. 9087--9094.
[42]
Tianzhu Zhang, Si Liu, Changsheng Xu, Bin Liu, and Ming-Hsuan Yang. 2018. Correlation Particle Filter for Visual Tracking. IEEE Trans. Image Process., Vol. 27, 6 (2018), 2676--2687.
[43]
Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. 2022. ByteTrack: Multi-object Tracking by Associating Every Detection Box. In Proc. Eur. Conf. Comput. Vis. 1--21.
[44]
Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. 2016. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 589--597.
[45]
Xian Zhong, Ming Tan, Weijian Ruan, Wenxin Huang, Liang Xie, and Jingling Yuan. 2020. Dual-Direction Perception and Collaboration Network for Near-Online Multi-Object Tracking. In Proc. IEEE Int. Conf. Image Process. 2111--2115.
[46]
Xingyi Zhou, Tianwei Yin, Vladlen Koltun, and Philipp Krähenbühl. 2022. Global Tracking Transformers. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 8761--8770.
[47]
Huilin Zhu, Jingling Yuan, Xian Zhong, Liang Liao, and Zheng Wang. 2024. Find Gold in Sand: Fine-Grained Similarity Mining for Domain-Adaptive Crowd Counting. IEEE Trans. Multimedia, Vol. 26 (2024), 3842--3855.
[48]
Huilin Zhu, Jingling Yuan, Xian Zhong, Zhengwei Yang, Zheng Wang, and Shengfeng He. 2023. DAOT: Domain-Agnostically Aligned Optimal Transport for Domain-Adaptive Crowd Counting. In Proc. ACM Multimedia. 4319--4329.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. crowd localization
  2. motion-appearance fusion
  3. vision-language pre-training
  4. {multi-object tracking

Qualifiers

  • Research-article

Funding Sources

  • National Research Foundation Singapore under the AI Singapore Programme
  • Sanya Yazhou Bay Science and Technology City Administration scientific research project
  • Guangdong Natural Science Funds for Distinguished Young Scholar
  • National Natural Science Foundation of China

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 60
    Total Downloads
  • Downloads (Last 12 months)60
  • Downloads (Last 6 weeks)8
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media