Abstract
In this paper, we systematically analyze unsupervised domain adaptation pipelines for object classification in a challenging industrial setting. In contrast to standard natural object benchmarks existing in the field, our results highlight the most important design choices when only category-labeled CAD models are available but classification needs to be done with real-world images. Our domain adaptation pipeline achieves SoTA performance on the VisDA benchmark, but more importantly, drastically improves recognition performance on our new open industrial dataset comprised of 102 mechanical parts. We conclude with a set of guidelines that are relevant for practitioners needing to apply state-of-the-art unsupervised domain adaptation in practice. Our code is available at https://github.com/dritter-bht/synthnet-transfer-learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: ICLR (2021)
Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2096–2030 (2016)
Goehring, D., Hoffman, J., Rodner, E., Saenko, K., Darrell, T.: Interactive adaptation of real-time object detectors. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1282–1289. IEEE (2014)
Goodfellow, I.J., et al.: Generative adversarial nets. In: NeurIPS (2014)
Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: a simple data processing method to improve robustness and uncertainty. In: ICLR (2019)
Hoffman, J., et al.: Cycada: cycle-consistent adversarial domain adaptation. In: ICML (2017)
Hoyer, L., Dai, D., Wang, H., Van Gool, L.: Mic: masked image consistency for context-enhanced domain adaptation. In: CVPR, pp. 11721–11732 (2023)
Jiang, J., Chen, B., Fu, B., Long, M.: Transfer-learning-library (2020). https://github.com/thuml/Transfer-Learning-Library
Jiang, J., Shu, Y., Wang, J., Long, M.: Transferability in deep learning: a survey. ArXiv arxiv:2201.05867 (2022)
Jin, Y., Wang, X., Long, M., Wang, J.: Minimum class confusion for versatile domain adaptation. In: ECCV (2019)
Kang, G., Jiang, L., Yang, Y., Hauptmann, A.: Contrastive adaptation network for unsupervised domain adaptation. In: CVPR, pp. 4888–4897 (2019)
Kim, D., Wang, K., Sclaroff, S., Saenko, K.: A broad study of pre-training for domain generalization and adaptation. In: ECCV (2022)
Kumar, A., Raghunathan, A., Jones, R.M., Ma, T., Liang, P.: Fine-tuning can distort pretrained features and underperform out-of-distribution. In: ICLR (2022)
Lee, C.Y., Batra, T., Baig, M.H., Ulbricht, D.: Sliced wasserstein discrepancy for unsupervised domain adaptation. In: CVPR, pp. 10277–10287 (2019)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, Z., et al.: Swin transformer v2: scaling up capacity and resolution. In: CVPR, pp. 11999–12009 (2021)
Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: ICML, pp. 97–105. PMLR (2015)
Long, M., Cao, Z., Wang, J., Jordan, M.I.: Conditional adversarial domain adaptation. In: NeurIPS (2017)
Long, M., Zhu, H., Wang, J., Jordan, M.I.: Deep transfer learning with joint adaptation networks. In: ICML (2016)
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv: Learning (2016)
Peng, X., Usman, B., Kaushik, N., Wang, D., Hoffman, J., Saenko, K.: Visda: a synthetic-to-real benchmark for visual domain adaptation. In: CVPR-W, pp. 2021–2026 (2018)
Rangwani, H., Aithal, S.K., Mishra, M., Jain, A., Babu, R.V.: A closer look at smoothness in domain adversarial training. In: ICML (2022)
Real, E., Shlens, J., Mazzocchi, S., Pan, X., Vanhoucke, V.: Youtube-boundingboxes: a large high-precision human-annotated data set for object detection in video. In: CVPR, pp. 7464–7473 (2017)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers and distillation through attention. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 10347–10357. PMLR (2021)
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: CVPR, pp. 2962–2971 (2017)
Woo, S., et al.: Convnext v2: co-designing and scaling convnets with masked autoencoders. ArXiv arxiv:2301.00808 (2023)
Xu, T., Chen, W., Pichao, W., Wang, F., Li, H., Jin, R.: Cdtrans: cross-domain transformer for unsupervised domain adaptation. In: ICLR (2021)
Yang, J., Liu, J., Xu, N., Huang, J.: Tvt: transferable vision transformer for unsupervised domain adaptation. In: WACV, pp. 520–530 (2021)
Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)
Acknowledgements
This work was funded by the German Federal Ministry of Education and Research (BMBF) through their support of the project SynthNet, a part of the KMU-Innovativ initiative (project code: 01IS21002C), the KI-Werkstatt project at the University of Applied Sciences Berlin (part of the Forschung an Fachhochschulen program (project code: 13FH028KI1) as well as project TAHAI (funded by IFAF Berlin).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Implementation Details
1.1 A.1 Adapting Pretrained Models to Rendered Images Implementation Details
We use pretrained models “google/vit-base-patch16-224-in21k" (ViT) [2], “microsoft/swinv2-base-patch4-window12-192-22k" (SwinV2) [17], “facebook/convnextv2-base-22k-224" (ConvNextV2) [27], and “facebook/deit-base-distilled-patch16-224" (DeiT) [25] from HuggingfaceFootnote 1 for experiments using the VisDA-2017 dataset but only ViT and SwinV2 for our Topex-Printer dataset. ViT, SwinV2, and ConvNextV2 were pretrained on ImageNet22K, while DeiT has been pretrained on ImagNet1K. We perform three different training schemes, training the classification head only (CH), fine-tuning the full model (FT), and a combination of CH and FT, tuning the classification head first and continuing with full fine-tuning (CH-FT) inspired by [14].
-
1.
For CH we use the PytorchFootnote 2 SGD optimizer with learning rates [10.0, 0.1, 0.001], momentum 0.9, no weight decay, no learning rate scheduler, and no warmup.
-
2.
For FT we use the Pytorch implementation of AdamW optimizer with learning rates [0.1, 0.001, 0.00001], weight decay 0.01, cosine annealing learning rate schedulerFootnote 3 [21] without restarts, and two warmup epochs (10% of total epochs).
For both datasets for data augmentation Pytorch 2.0.0 implementationFootnote 4 is used.
1.2 A.2 Adapting to Real-World Images with Unsupervised Domain Adaptation Implementation Details
For UDA experiments we start from the best source-domain-only trained CH checkpoint with respect to the model architecture and continue training using the same parameters as the best FT run for each model as described in the paper. We use Pytorch 2.0.0 implementations of image augmentations random resized crop, horizontal flip, and AugMix [6] with the same parameters described in the last paragraph of Sect. A.1. We use the Transfer Learning Library (tllib) [9, 10] implementations of CDAN (hidden size 1024) and MCC [11] (temperature 1.0) domain adaptation methods and also combine both using two different initial checkpoints for each model architecture. One initial checkpoint from Huggingface, pretrained on ImageNet22K [1] (“google/vit-base-patch16-224-in21k" (ViT) and “microsoft/swinv2-base-patch4-window12-192-22k" (SwinV2)) and the best-performing checkpoint after training only the classification head from our source-domain-only experiments. Again, we use global random seed 42 for all experiments and training is performed on a single Nvidia Tesla V100 PCIE 32GB GPU.
Different from other methods, we perform considerably better correctly identifying the truck class but underperform on the motorcycle and person class instead. The confusion matrix shown in Fig. 6 shows, that our trained model often mixes up motorcycle samples with bicycles (7%) and skateboards (10%) while the person class is mixed up rather uniformly (3%–4%) with skateboards, plants, motorcycles, and horses.
B Dataset Samples
C Evaluation Results
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ritter, D., Hemberger, M., Hönig, M., Stopp, V., Rodner, E., Hildebrand, K. (2025). CAD Models to Real-World Images: A Practical Approach to Unsupervised Domain Adaptation in Industrial Object Classification. In: Meo, R., Silvestri, F. (eds) Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2023. Communications in Computer and Information Science, vol 2136. Springer, Cham. https://doi.org/10.1007/978-3-031-74640-6_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-74640-6_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-74639-0
Online ISBN: 978-3-031-74640-6
eBook Packages: Artificial Intelligence (R0)