Abstract:
The common paradigm of CNN-based action recognition modelsis to simply use the average of the dense predictions from every frame. However, these dense predictions are ine...Show MoreMetadata
Abstract:
The common paradigm of CNN-based action recognition modelsis to simply use the average of the dense predictions from every frame. However, these dense predictions are inefficient since all frames are evenly utilized regardless of the existence of the action. In real-time action recognition applications where the input video is untrimmed, the dense prediction is even more severely inefficient. Instead of these dense predictions, we propose a lightweight CNN-based sampler that selects frames that are relevant to the action in the video. The proposed sampler is trained by a self-supervised learning method that mixes one irrelevant frame in an input video clip and finds that irrelevant frame. With the trained sampler, we select frames to be used as inputs to the action recognition model. Our sampler can be combined with any action recognition models and works independently of them. Through various experiments, we demonstrate that the proposed sampler is not only time and memory-efficient but also significantly increases performance in the action recognition benchmarks, including HMDB51 and UCF101. Finally, we applied the self-supervised sampler to the real-time surveillance system in the subway station, and improved the action recognition performance by 7% with 80% reduction in computation. Code and dataset are available in https://github.com/seominseok0429/SelSupSampler and https://aihub.or.kr/aidata/34124.
Published in: IEEE Robotics and Automation Letters ( Volume: 7, Issue: 2, April 2022)