In this paper, we systematically analyze unsupervised domain adaptation pipelines for object classification in a challenging industrial setting. In contrast to standard natural object benchmarks existing in the field, our results highlight the most important design choices when only category-labeled CAD models are available but classification needs to be done with real-world images. Our domain adaptation pipeline achieves SoTA performance on the VisDA benchmark, but more importantly, drastically improves recognition performance on our new open industrial dataset comprised of 102 mechanical parts. We conclude with a set of guidelines that are relevant for practitioners needing to apply state-of-the-art unsupervised domain adaptation in practice. Our code is available at https://github.com/dritter-bht/synthnet-transfer-learning.
This work was funded by the German Federal Ministry of Education and Research (BMBF) through their support of the project SynthNet, a part of the KMU-Innovativ initiative (project code: 01IS21002C), the KI-Werkstatt project at the University of Applied Sciences Berlin (part of the Forschung an Fachhochschulen program (project code: 13FH028KI1) as well as project TAHAI (funded by IFAF Berlin).
A Implementation Details
1.1 A.1 Adapting Pretrained Models to Rendered Images Implementation Details
We use pretrained models “google/vit-base-patch16-224-in21k" (ViT) [2], “microsoft/swinv2-base-patch4-window12-192-22k" (SwinV2) [17], “facebook/convnextv2-base-22k-224" (ConvNextV2) [27], and “facebook/deit-base-distilled-patch16-224" (DeiT) [25] from HuggingfaceFootnote 1 for experiments using the VisDA-2017 dataset but only ViT and SwinV2 for our Topex-Printer dataset. ViT, SwinV2, and ConvNextV2 were pretrained on ImageNet22K, while DeiT has been pretrained on ImagNet1K. We perform three different training schemes, training the classification head only (CH), fine-tuning the full model (FT), and a combination of CH and FT, tuning the classification head first and continuing with full fine-tuning (CH-FT) inspired by [14].
For CH we use the PytorchFootnote 2 SGD optimizer with learning rates [10.0, 0.1, 0.001], momentum 0.9, no weight decay, no learning rate scheduler, and no warmup.
For FT we use the Pytorch implementation of AdamW optimizer with learning rates [0.1, 0.001, 0.00001], weight decay 0.01, cosine annealing learning rate schedulerFootnote 3 [21] without restarts, and two warmup epochs (10% of total epochs).
For both datasets for data augmentation Pytorch 2.0.0 implementationFootnote 4 is used.
1.2 A.2 Adapting to Real-World Images with Unsupervised Domain Adaptation Implementation Details
For UDA experiments we start from the best source-domain-only trained CH checkpoint with respect to the model architecture and continue training using the same parameters as the best FT run for each model as described in the paper. We use Pytorch 2.0.0 implementations of image augmentations random resized crop, horizontal flip, and AugMix [6] with the same parameters described in the last paragraph of Sect. A.1. We use the Transfer Learning Library (tllib) [9, 10] implementations of CDAN (hidden size 1024) and MCC [11] (temperature 1.0) domain adaptation methods and also combine both using two different initial checkpoints for each model architecture. One initial checkpoint from Huggingface, pretrained on ImageNet22K [1] (“google/vit-base-patch16-224-in21k" (ViT) and “microsoft/swinv2-base-patch4-window12-192-22k" (SwinV2)) and the best-performing checkpoint after training only the classification head from our source-domain-only experiments. Again, we use global random seed 42 for all experiments and training is performed on a single Nvidia Tesla V100 PCIE 32GB GPU.
Different from other methods, we perform considerably better correctly identifying the truck class but underperform on the motorcycle and person class instead. The confusion matrix shown in Fig. 6 shows, that our trained model often mixes up motorcycle samples with bicycles (7%) and skateboards (10%) while the person class is mixed up rather uniformly (3%–4%) with skateboards, plants, motorcycles, and horses.
