Abstract:
We focus on the problem of fine-grained visual classification (FGVC). We posit that unreasonable effectiveness of the state-of-the-art in this area is because of similar ...Show MoreMetadata
Abstract:
We focus on the problem of fine-grained visual classification (FGVC). We posit that unreasonable effectiveness of the state-of-the-art in this area is because of similar object categories present in the ImageNet dataset, which allows such models to be pretrained on a much larger set of samples and learn generic features for those object categories. We observe an important and often ignored additional structure present in an FGVC problem: the objects are captured from a small set of viewing angles only. We notice that subtle differences between object categories are difficult to pick from an arbitrary angle but easier to identify from a similar pose. We show in this paper that training specialized pose experts, focusing on classification from a single, fixed pose, and combining them in an ensemble style framework successfully exploits the structure in the problem. We demonstrate the effectiveness of the proposed approach on the benchmark Stanford Cars, FGVC-Aircrafts, and DeepFashion datasets. To highlight the contribution when the target category features may not be available in a pretrained network, we test on footwear class. We contribute a new 1000 object, 12 category footwear dataset, each object captured from 4 different poses and show significant improvement on this dataset.
Date of Conference: 07-10 October 2018
Date Added to IEEE Xplore: 06 September 2018
ISBN Information:
Electronic ISSN: 2381-8549