Abstract
The Multi-mOdality Cut and pAste (MoCa) method cuts data from other frames and pastes it onto the current training data frame to increase the number of training object samples. However, the samples used by MoCa are all derived from the original dataset, which limits its ability to enhance object diversity. Recently, diffusion models have achieved remarkable results in the field of image generation, where simple prompts can enable the model to create entirely different paintings. In this paper, we propose DiffMoCa, which leverages the powerful creative capabilities of diffusion models to redraw the images cut by MoCa, thereby increasing the diversity of the objects and enhancing the generalization ability of the model. DiffMoCa demonstrates its capabilities in extensive experiments, wherein it surpasses MoCa by 2.2% in mAP on the KITTI dataset under moderate conditions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEEE (2012)
Goodfellow, I., et al.: Generative adversarial nets. Adv. Neural. Inf. Process. Syst. 27, 2672–2680 (2014)
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)
Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119 (2020)
Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3d proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. IEEE (2018)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3d object detection from RGB-D data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)
Qi, L., Jiang, L., Liu, S., Shen, X., Jia, J.: Amodal instance segmentation with KINS dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3014–3023 (2019)
Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Rout, L., Raoof, N., Daras, G., Caramanis, C., Dimakis, A., Shakkottai, S.: Solving linear inverse problems provably via posterior sampling with latent diffusion models. Adv. Neural Inf. Process. Syst. 36 (2024)
Sindagi, V.A., Zhou, Y., Tuzel, O.: MVX-Net: multimodal VoxelNet for 3d object detection. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7276–7282. IEEE (2019)
Tong, W., et al.: 3d data augmentation for driving scenes on camera. arXiv preprint arXiv:2303.10340 (2023)
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: PointPainting: sequential fusion for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4604–4612 (2020)
Wang, Z., Jia, K.: Frustum ConvNet: sliding frustums to aggregate local point-wise features for amodal 3d object detection. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1742–1749. IEEE (2019)
Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847 (2023)
Zhang, W., Wang, Z., Loy, C.C.: Exploring data augmentation for multi-modality 3d object detection. arXiv preprint arXiv:2012.12741 (2020)
Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492 (2019)
Acknowledgement
The work was supported by the National Key R &D Program of China under Grant 2021ZD0201300, the National Natural Science Foundation of China under Grants U1913602 and 61936004, the Innovation Group Project of the National Natural Science Foundation of China under Grant 61821003, and the 111 Project on Computational Intelligence and Intelligent Control under Grant B18024.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhang, J., Wu, S., Gao, J., Yu, F., Xu, H., Zeng, Z. (2024). DiffMoCa: Diffusion Model Based Multi-modality Cut and Paste. In: Le, X., Zhang, Z. (eds) Advances in Neural Networks – ISNN 2024. ISNN 2024. Lecture Notes in Computer Science, vol 14827. Springer, Singapore. https://doi.org/10.1007/978-981-97-4399-5_15
Download citation
DOI: https://doi.org/10.1007/978-981-97-4399-5_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-4398-8
Online ISBN: 978-981-97-4399-5
eBook Packages: Computer ScienceComputer Science (R0)