skip to main content
10.1145/3474085.3475365acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Do We Really Need Frame-by-Frame Annotation Datasets for Object Tracking?

Published: 17 October 2021 Publication History

Abstract

There has been an increasing emphasis on building large-scale datasets as the driver of deep learning-based trackers' success. However, accurately annotating tracking data is highly labor-intensive and expensive, making it infeasible in real-world applications. In this study, we investigate the necessity of large-scale training data to ensure tracking algorithms' performance. To this end, we introduce a FAT (Few-Annotation Tracking) benchmark constructed by sampling one or a few frames per video from some existing tracking datasets. The proposed dataset can be used to evaluate the effectiveness of tracking algorithms considering data efficiency and new data augmentation approaches for object tracking. We further present AMMC (Augmentation by Mimicking Motion Change), a data augmentation strategy that enables learning high-performing trackers using small-scale datasets. AMMC first cuts out the tracked targets and performs a sequence of transformations to simulate the possible change by object motion. Then the transformed targets are pasted on the inpainted background images and further conjointly augmented to mimic variability caused by camera motion. Compared with standard augmentation methods, AMMC explicitly considers tracking data characteristics, which synthesizes more valid data for object tracking. We extensively evaluate our approach with two popular trackers on the FAT datasets. Experiments show that our method allows these trackers to even trained on a dataset requiring much less annotation to achieve comparable or even better performance to those on the full-annotation dataset. The results imply complete video annotation might not be necessary for object tracking if leveraging motion-driven data augmentations during training.

Supplementary Material

MP4 File (MM21-fp1178.mp4)
Accurately annotating tracking data is highly labor-intensive and expensive. In this study, we investigate the necessity of large-scale training data to ensure tracking algorithms? performance. To this end, we introduce a FAT (Few-Annotation Tracking) benchmark constructed by sampling one or a few frames per video from some existing tracking datasets. The proposed dataset can be used to evaluate the effectiveness of tracking algorithms considering data efficiency and new data augmentation approaches for object tracking. We further present AMMC (Augmentation by Mimicking Motion Change), a data augmentation strategy that enables learning high-performing trackers using small-scale datasets. We extensively evaluate our approach with two popular trackers on the FAT datasets. The results imply complete video annotation might not be necessary for object tracking if leveraging motion-driven data augmentations during training.

References

[1]
Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. 2016. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision. 850--865.
[2]
Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. 2019. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6182--6191.
[3]
Goutam Bhat, Joakim Johnander, Martin Danelljan, Fahad Shahbaz Khan, and Michael Felsberg. 2018. Unveiling the power of deep tracking. In Proceedings of the European Conference on Computer Vision. 483--498.
[4]
Ali Braytee, Wei Liu, Daniel R Catchpoole, and Paul J Kennedy. 2017. Multi-label feature selection using correlation information. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1649--1656.
[5]
Zhe Chen, Shaoli Huang, and Dacheng Tao. 2018. Context refinement for object detection. In Proceedings of the European conference on computer vision. 71--86.
[6]
Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, and Rongrong Ji. 2020. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6668--6677.
[7]
Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. 2019. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition. 113--123.
[8]
Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. 2019. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4660--4669.
[9]
Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. 2017. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6638--6646.
[10]
Martin Danelljan, Andreas Robinson, Fahad Shahbaz Khan, and Michael Felsberg. 2016. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Proceedings of the European conference on computer vision. Springer, 472--488.
[11]
Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. 2019. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5374--5383.
[12]
Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. 2021. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2918--2928.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[14]
David Held, Sebastian Thrun, and Silvio Savarese. 2016. Learning to track at 100 fps with deep regression networks. In Proceedings of the European conference on computer vision. Springer, 749--765.
[15]
Lianghua Huang, Xin Zhao, and Kaiqi Huang. 2019. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).
[16]
Shaoli Huang, Xinchao Wang, and Dacheng Tao. 2021. SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 1628--1636.
[17]
Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yuning Jiang. 2018. Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision. 784--799.
[18]
Ilchae Jung, Jeany Son, Mooyeol Baek, and Bohyung Han. 2018. Real-time mdnet. In Proceedings of the European conference on computer vision. 83--98.
[19]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations .
[20]
Matej Kristan, Jiri Matas, Alevs Leonardis, Tomas Vojir, Roman Pflugfelder, Gustavo Fernandez, Georg Nebehay, Fatih Porikli, and Luka vCehovin. 2016. A Novel Performance Evaluation Methodology for Single-Target Trackers. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 38, 11 (Nov 2016), 2137--2155.
[21]
Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. 2019. Siamrpn+: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4282--4291.
[22]
Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. 2018. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8971--8980.
[23]
Xi Li, Weiming Hu, Chunhua Shen, Zhongfei Zhang, Anthony Dick, and Anton Van Den Hengel. 2013. A survey of appearance models in visual object tracking. ACM transactions on Intelligent Systems and Technology (TIST), Vol. 4, 4 (2013), 1--48.
[24]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision. Springer, 740--755.
[25]
Chunlei Liu, Wenrui Ding, Jinyu Yang, Vittorio Murino, Baochang Zhang, Jungong Han, and Guodong Guo. 2019. Aggregation signature for small object tracking. IEEE Transactions on Image Processing, Vol. 29 (2019), 1738--1747.
[26]
Chao Ma, Jia-Bin Huang, Xiaokang Yang, and Ming-Hsuan Yang. 2015. Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE international conference on computer vision. 3074--3082.
[27]
Seyed Mojtaba Marvasti-Zadeh, Li Cheng, Hossein Ghanei-Yakhdan, and Shohreh Kasaei. 2021. Deep learning for visual tracking: A comprehensive survey. IEEE Transactions on Intelligent Transportation Systems (2021).
[28]
Matthias Mueller, Neil Smith, and Bernard Ghanem. 2016. A benchmark and simulator for uav tracking. In Proceedings of the European conference on computer vision. Springer, 445--461.
[29]
Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. 2018. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision. 300--317.
[30]
Hyeonseob Nam and Bohyung Han. 2016. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4293--4302.
[31]
Adam Paszke, S. Gross, Francisco Massa, A. Lerer, J. Bradbury, G. Chanan, Trevor Killeen, Z. Lin, N. Gimelshein, L. Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, B. Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Neural Information Processing Systems .
[32]
Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. 2017. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5296--5305.
[33]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, Vol. 115, 3 (2015), 211--252.
[34]
K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations .
[35]
Arnold WM Smeulders, Dung M Chu, Rita Cucchiara, Simone Calderara, Afshin Dehghan, and Mubarak Shah. 2013. Visual tracking: An experimental survey. IEEE transactions on pattern analysis and machine intelligence, Vol. 36, 7 (2013), 1442--1468.
[36]
Jack Valmadre, Luca Bertinetto, Joao Henriques, Andrea Vedaldi, and Philip HS Torr. 2017. End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2805--2813.
[37]
Sunny Verma, Chen Wang, Liming Zhu, and Wei Liu. 2019. Deepcu: Integrating both common and unique latent information for multimodal sentiment analysis. In International Joint Conference on Artificial Intelligence. International Joint Conferences on Artificial Intelligence Organization.
[38]
Guangting Wang, Chong Luo, Xiaoyan Sun, Zhiwei Xiong, and Wenjun Zeng. 2020. Tracking by instance detection: A meta-learning approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6288--6297.
[39]
Ning Wang, Yibing Song, Chao Ma, Wen gang Zhou, W. Liu, and Houqiang Li. 2019. Unsupervised Deep Tracking. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1308--1317.
[40]
Y. Wu, J. Lim, and M. Yang. 2015. Object Tracking Benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 37, 9 (2015), 1834--1848.
[41]
Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. 2013. Online Object Tracking: A Benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .
[42]
Alper Yilmaz, Omar Javed, and Mubarak Shah. 2006. Object tracking: A survey. Acm computing surveys (CSUR), Vol. 38, 4 (2006), 13--es.
[43]
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6023--6032.
[44]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).
[45]
Lianbo Zhang, Shaoli Huang, and Wei Liu. 2021. Intra-class Part Swapping for Fine-Grained Image Classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3209--3218.
[46]
Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13001--13008.
[47]
Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and Weiming Hu. 2018. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision. 101--117.

Cited By

View all
  • (2024)Target-Aware Transformer for Satellite Video Object TrackingIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2023.333965862(1-10)Online publication date: 2024
  • (2024)A survey of video-based human action recognition in team sportsArtificial Intelligence Review10.1007/s10462-024-10934-957:11Online publication date: 16-Sep-2024
  • (2023)A Hybrid Approach Based on GAN and CNN-LSTM for Aerial Activity RecognitionRemote Sensing10.3390/rs1514362615:14(3626)Online publication date: 21-Jul-2023
  • Show More Cited By

Index Terms

  1. Do We Really Need Frame-by-Frame Annotation Datasets for Object Tracking?

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '21: Proceedings of the 29th ACM International Conference on Multimedia
    October 2021
    5796 pages
    ISBN:9781450386517
    DOI:10.1145/3474085
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data augmentation
    2. small-scale dataset
    3. visual object tracking

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '21
    Sponsor:
    MM '21: ACM Multimedia Conference
    October 20 - 24, 2021
    Virtual Event, China

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Target-Aware Transformer for Satellite Video Object TrackingIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2023.333965862(1-10)Online publication date: 2024
    • (2024)A survey of video-based human action recognition in team sportsArtificial Intelligence Review10.1007/s10462-024-10934-957:11Online publication date: 16-Sep-2024
    • (2023)A Hybrid Approach Based on GAN and CNN-LSTM for Aerial Activity RecognitionRemote Sensing10.3390/rs1514362615:14(3626)Online publication date: 21-Jul-2023
    • (2023)NCSiam: Reliable Matching via Neighborhood Consensus for Siamese-Based Object TrackingIEEE Transactions on Image Processing10.1109/TIP.2023.332966932(6168-6182)Online publication date: 2023
    • (2023)IoUNet++IET Computer Vision10.1049/cvi2.1223518:1(177-189)Online publication date: 16-Oct-2023
    • (2022)Survey on Videos Data Augmentation for Deep Learning ModelsFuture Internet10.3390/fi1403009314:3(93)Online publication date: 16-Mar-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media