Skip to main content

CAD Models to Real-World Images: A Practical Approach to Unsupervised Domain Adaptation in Industrial Object Classification

  • Conference paper
  • First Online:
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2023)

Abstract

In this paper, we systematically analyze unsupervised domain adaptation pipelines for object classification in a challenging industrial setting. In contrast to standard natural object benchmarks existing in the field, our results highlight the most important design choices when only category-labeled CAD models are available but classification needs to be done with real-world images. Our domain adaptation pipeline achieves SoTA performance on the VisDA benchmark, but more importantly, drastically improves recognition performance on our new open industrial dataset comprised of 102 mechanical parts. We conclude with a set of guidelines that are relevant for practitioners needing to apply state-of-the-art unsupervised domain adaptation in practice. Our code is available at https://github.com/dritter-bht/synthnet-transfer-learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://huggingface.co/models.

  2. 2.

    https://pytorch.org/.

  3. 3.

    https://huggingface.co/docs/transformers/main_classes/optimizer_schedules.

  4. 4.

    https://pytorch.org/vision/main/generated/torchvision.transforms.AugMix.html.

References

  1. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

    Google Scholar 

  2. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  3. Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2096–2030 (2016)

    ADS  MathSciNet  MATH  Google Scholar 

  4. Goehring, D., Hoffman, J., Rodner, E., Saenko, K., Darrell, T.: Interactive adaptation of real-time object detectors. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1282–1289. IEEE (2014)

    Google Scholar 

  5. Goodfellow, I.J., et al.: Generative adversarial nets. In: NeurIPS (2014)

    Google Scholar 

  6. Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: a simple data processing method to improve robustness and uncertainty. In: ICLR (2019)

    Google Scholar 

  7. Hoffman, J., et al.: Cycada: cycle-consistent adversarial domain adaptation. In: ICML (2017)

    Google Scholar 

  8. Hoyer, L., Dai, D., Wang, H., Van Gool, L.: Mic: masked image consistency for context-enhanced domain adaptation. In: CVPR, pp. 11721–11732 (2023)

    Google Scholar 

  9. Jiang, J., Chen, B., Fu, B., Long, M.: Transfer-learning-library (2020). https://github.com/thuml/Transfer-Learning-Library

  10. Jiang, J., Shu, Y., Wang, J., Long, M.: Transferability in deep learning: a survey. ArXiv arxiv:2201.05867 (2022)

  11. Jin, Y., Wang, X., Long, M., Wang, J.: Minimum class confusion for versatile domain adaptation. In: ECCV (2019)

    Google Scholar 

  12. Kang, G., Jiang, L., Yang, Y., Hauptmann, A.: Contrastive adaptation network for unsupervised domain adaptation. In: CVPR, pp. 4888–4897 (2019)

    Google Scholar 

  13. Kim, D., Wang, K., Sclaroff, S., Saenko, K.: A broad study of pre-training for domain generalization and adaptation. In: ECCV (2022)

    Google Scholar 

  14. Kumar, A., Raghunathan, A., Jones, R.M., Ma, T., Liang, P.: Fine-tuning can distort pretrained features and underperform out-of-distribution. In: ICLR (2022)

    Google Scholar 

  15. Lee, C.Y., Batra, T., Baig, M.H., Ulbricht, D.: Sliced wasserstein discrepancy for unsupervised domain adaptation. In: CVPR, pp. 10277–10287 (2019)

    Google Scholar 

  16. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  MATH  Google Scholar 

  17. Liu, Z., et al.: Swin transformer v2: scaling up capacity and resolution. In: CVPR, pp. 11999–12009 (2021)

    Google Scholar 

  18. Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: ICML, pp. 97–105. PMLR (2015)

    Google Scholar 

  19. Long, M., Cao, Z., Wang, J., Jordan, M.I.: Conditional adversarial domain adaptation. In: NeurIPS (2017)

    Google Scholar 

  20. Long, M., Zhu, H., Wang, J., Jordan, M.I.: Deep transfer learning with joint adaptation networks. In: ICML (2016)

    Google Scholar 

  21. Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv: Learning (2016)

  22. Peng, X., Usman, B., Kaushik, N., Wang, D., Hoffman, J., Saenko, K.: Visda: a synthetic-to-real benchmark for visual domain adaptation. In: CVPR-W, pp. 2021–2026 (2018)

    Google Scholar 

  23. Rangwani, H., Aithal, S.K., Mishra, M., Jain, A., Babu, R.V.: A closer look at smoothness in domain adversarial training. In: ICML (2022)

    Google Scholar 

  24. Real, E., Shlens, J., Mazzocchi, S., Pan, X., Vanhoucke, V.: Youtube-boundingboxes: a large high-precision human-annotated data set for object detection in video. In: CVPR, pp. 7464–7473 (2017)

    Google Scholar 

  25. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers and distillation through attention. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, vol. 139, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  26. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: CVPR, pp. 2962–2971 (2017)

    Google Scholar 

  27. Woo, S., et al.: Convnext v2: co-designing and scaling convnets with masked autoencoders. ArXiv arxiv:2301.00808 (2023)

  28. Xu, T., Chen, W., Pichao, W., Wang, F., Li, H., Jin, R.: Cdtrans: cross-domain transformer for unsupervised domain adaptation. In: ICLR (2021)

    Google Scholar 

  29. Yang, J., Liu, J., Xu, N., Huang, J.: Tvt: transferable vision transformer for unsupervised domain adaptation. In: WACV, pp. 520–530 (2021)

    Google Scholar 

  30. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)

    Google Scholar 

Download references

Acknowledgements

This work was funded by the German Federal Ministry of Education and Research (BMBF) through their support of the project SynthNet, a part of the KMU-Innovativ initiative (project code: 01IS21002C), the KI-Werkstatt project at the University of Applied Sciences Berlin (part of the Forschung an Fachhochschulen program (project code: 13FH028KI1) as well as project TAHAI (funded by IFAF Berlin).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dennis Ritter .

Editor information

Editors and Affiliations

Appendices

A Implementation Details

1.1 A.1 Adapting Pretrained Models to Rendered Images Implementation Details

We use pretrained models “google/vit-base-patch16-224-in21k" (ViT) [2], “microsoft/swinv2-base-patch4-window12-192-22k" (SwinV2) [17], “facebook/convnextv2-base-22k-224" (ConvNextV2) [27], and “facebook/deit-base-distilled-patch16-224" (DeiT) [25] from HuggingfaceFootnote 1 for experiments using the VisDA-2017 dataset but only ViT and SwinV2 for our Topex-Printer dataset. ViT, SwinV2, and ConvNextV2 were pretrained on ImageNet22K, while DeiT has been pretrained on ImagNet1K. We perform three different training schemes, training the classification head only (CH), fine-tuning the full model (FT), and a combination of CH and FT, tuning the classification head first and continuing with full fine-tuning (CH-FT) inspired by [14].

  1. 1.

    For CH we use the PytorchFootnote 2 SGD optimizer with learning rates [10.0, 0.1, 0.001], momentum 0.9, no weight decay, no learning rate scheduler, and no warmup.

  2. 2.

    For FT we use the Pytorch implementation of AdamW optimizer with learning rates [0.1, 0.001, 0.00001], weight decay 0.01, cosine annealing learning rate schedulerFootnote 3 [21] without restarts, and two warmup epochs (10% of total epochs).

For both datasets for data augmentation Pytorch 2.0.0 implementationFootnote 4 is used.

1.2 A.2 Adapting to Real-World Images with Unsupervised Domain Adaptation Implementation Details

For UDA experiments we start from the best source-domain-only trained CH checkpoint with respect to the model architecture and continue training using the same parameters as the best FT run for each model as described in the paper. We use Pytorch 2.0.0 implementations of image augmentations random resized crop, horizontal flip, and AugMix [6] with the same parameters described in the last paragraph of Sect. A.1. We use the Transfer Learning Library (tllib) [9, 10] implementations of CDAN (hidden size 1024) and MCC [11] (temperature 1.0) domain adaptation methods and also combine both using two different initial checkpoints for each model architecture. One initial checkpoint from Huggingface, pretrained on ImageNet22K [1] (“google/vit-base-patch16-224-in21k" (ViT) and “microsoft/swinv2-base-patch4-window12-192-22k" (SwinV2)) and the best-performing checkpoint after training only the classification head from our source-domain-only experiments. Again, we use global random seed 42 for all experiments and training is performed on a single Nvidia Tesla V100 PCIE 32GB GPU.

Different from other methods, we perform considerably better correctly identifying the truck class but underperform on the motorcycle and person class instead. The confusion matrix shown in Fig. 6 shows, that our trained model often mixes up motorcycle samples with bicycles (7%) and skateboards (10%) while the person class is mixed up rather uniformly (3%–4%) with skateboards, plants, motorcycles, and horses.

B Dataset Samples

(See Figs. 3, 4 and 5).

Fig. 3.
figure 3

80 random samples of rendered images from the Topex-Printer dataset. Each image \(512^2\), featuring machine parts marked with bounding boxes, is trimmed according to these boxes, extended to form a rectangle, and padded with black if needed. Finally, all images are resized to a resolution of 256\(\times \)256 pixels.

Fig. 4.
figure 4

80 random samples of real images from the Topex-Printer dataset.

Fig. 5.
figure 5

(Best viewed in color) Left (a): HDRI of the warehouse environment map used in our rendering scene. Image by Sergej Majboroda [CC0], via Polyhaven. Right (b): Our handcrafted Blender material collection we used for the Topex-Printer dataset.

C Evaluation Results

(See Tables 3, 5 and 8).

Table 2. Acc@1 in % on target domain (real images) for all source-domain-only training experiments on VisDA-2017 classification dataset. Note that base transform means that random color jitter and random grayscale transforms are applied. Faded out rows are representing numerically instable runs that have been canceled due to NaN loss for example.
Table 3. Acc@1 in % on target domain (real images) for all source-domain-only training experiments on the Topex-Printer dataset. Note that base transform means that random color jitter and random grayscale transforms are applied. Faded out rows are representing numerically instable runs that have been canceled due to NaN loss for example.
Table 4. Acc@1 in % on target domain (real images) for best results per model and training scheme in our source domain training experiments on VisDA-2017 classification dataset. Note that base transform means that random color jitter and random grayscale transforms are applied instead of AugMix (other augmentations stay the same as explained in Sect. A.1).
Table 5. Acc@1 in % on target domain (real images) for best results per model and training scheme in our source-domain-only training experiments on Topex-Printer dataset. Note that base transform means that random color jitter and random grayscale transforms are applied instead of AugMix (other augmentations stay the same as explained in Sect. A.1).
Table 6. Acc@1 in % on target domain (real images) for all UDA experiments on VisDA-2017 classification dataset. Note that init checkpoint describes the model checkpoint used for the UDA experiments. CH refers to the best-performing CH training scheme from our DG experiments respecting the used model architecture and IN22K refers to the respective Huggingface model checkpoints described in Sect. A.2.
Table 7. Acc@1 in % on target domain (real images) for all UDA experiments on the Topex-Printer dataset. Note that init checkpoint describes the model checkpoint used for the UDA experiments. CH refers to the best-performing CH training scheme from our source-domain-only training experiments respecting the used model architecture and IN22K refers to the respective Huggingface model checkpoints described in Sect. A.2.
Fig. 6.
figure 6

Confusion matrix for our best-performing model on VisDA-2017: SwinV2-CH-CDAN-MCC

Table 8. Image classification top-1 accuracy in % on VisDA-2017 target domain (real images) across all classes compared to literature. We report our best source-domain-only and UDA runs for the ViT and SwinV2 architecture.

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ritter, D., Hemberger, M., Hönig, M., Stopp, V., Rodner, E., Hildebrand, K. (2025). CAD Models to Real-World Images: A Practical Approach to Unsupervised Domain Adaptation in Industrial Object Classification. In: Meo, R., Silvestri, F. (eds) Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2023. Communications in Computer and Information Science, vol 2136. Springer, Cham. https://doi.org/10.1007/978-3-031-74640-6_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-74640-6_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-74639-0

  • Online ISBN: 978-3-031-74640-6

  • eBook Packages: Artificial Intelligence (R0)

Publish with us

Policies and ethics