research-article

Do We Really Need Frame-by-Frame Annotation Datasets for Object Tracking?

Authors:

Lei Hu,

Shaoli Huang,

Shilei Wang,

Wei Liu,

Jifeng NingAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 4949 - 4957

https://doi.org/10.1145/3474085.3475365

Published: 17 October 2021 Publication History

Get Access

Abstract

There has been an increasing emphasis on building large-scale datasets as the driver of deep learning-based trackers' success. However, accurately annotating tracking data is highly labor-intensive and expensive, making it infeasible in real-world applications. In this study, we investigate the necessity of large-scale training data to ensure tracking algorithms' performance. To this end, we introduce a FAT (Few-Annotation Tracking) benchmark constructed by sampling one or a few frames per video from some existing tracking datasets. The proposed dataset can be used to evaluate the effectiveness of tracking algorithms considering data efficiency and new data augmentation approaches for object tracking. We further present AMMC (Augmentation by Mimicking Motion Change), a data augmentation strategy that enables learning high-performing trackers using small-scale datasets. AMMC first cuts out the tracked targets and performs a sequence of transformations to simulate the possible change by object motion. Then the transformed targets are pasted on the inpainted background images and further conjointly augmented to mimic variability caused by camera motion. Compared with standard augmentation methods, AMMC explicitly considers tracking data characteristics, which synthesizes more valid data for object tracking. We extensively evaluate our approach with two popular trackers on the FAT datasets. Experiments show that our method allows these trackers to even trained on a dataset requiring much less annotation to achieve comparable or even better performance to those on the full-annotation dataset. The results imply complete video annotation might not be necessary for object tracking if leveraging motion-driven data augmentations during training.

Supplementary Material

MP4 File (MM21-fp1178.mp4)

Accurately annotating tracking data is highly labor-intensive and expensive. In this study, we investigate the necessity of large-scale training data to ensure tracking algorithms? performance. To this end, we introduce a FAT (Few-Annotation Tracking) benchmark constructed by sampling one or a few frames per video from some existing tracking datasets. The proposed dataset can be used to evaluate the effectiveness of tracking algorithms considering data efficiency and new data augmentation approaches for object tracking. We further present AMMC (Augmentation by Mimicking Motion Change), a data augmentation strategy that enables learning high-performing trackers using small-scale datasets. We extensively evaluate our approach with two popular trackers on the FAT datasets. The results imply complete video annotation might not be necessary for object tracking if leveraging motion-driven data augmentations during training.

Download
74.25 MB

References

[1]

Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. 2016. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision. 850--865.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Visual Object Tracking Based on Mean-shift and Particle-Kalman Filter

The Eighth Visual Object Tracking VOT2020 Challenge Results

Hard Occlusions in Visual Object Tracking

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations