Abstract
We introduce MaskEditor, an object-level 3D neural field editing method based on text instructions. Different from manipulating the whole scene, local editing needs accurate locating and proper field fusion to provide a realistic object-level replacement. We utilize a 3D mask grid to accurately localize the target object leveraging the 2D segmentation information provided by the Segment Anything Model (SAM). The whole scene is divided into the object field and background field based on the learned 3D mask. Subsequently, we apply the Variational Score Distillation (VSD) to optimize the object field and leave the background field unaltered, which achieves editing results aligned with text instructions. Furthermore, we implement composited rendering and coarse-to-fine editing strategy to enhance the editing quality and the consistency of the edited object with the original scene. Qualitative and quantitative evaluations confirm that MaskEditor achieves more precise and superior local editing compared to baselines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bao, C., et al.: Sine: semantic-driven image-based nerf editing with prior-guided editing field. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20919–20929 (2023)
Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18392–18402 (2023)
Cen, J., et al.: Segment anything in 3d with nerfs. In: NeurIPS (2023)
Chen, J.K., Lyu, J., Wang, Y.X.: Neuraleditor: Editing neural radiance fields via manipulating point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12439–12448 (2023)
Chen, Y., et al.: Gaussianeditor: swift and controllable 3d editing with gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21476–21485 (2024)
Gordon, O., Avrahami, O., Lischinski, D.: Blended-nerf: Zero-shot object generation and blending in existing neural radiance fields (2023). arXiv:2306.12760
Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-nerf2nerf: editing 3d scenes with instructions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)
Kirillov, A., et al.: Segment anything (2023). arXiv:2304.02643
Lin, C.H., et al.: Magic3d: High-resolution text-to-3d content creation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 300–309 (2023)
Liu, K., et al.: Stylerf: zero-shot 3d style transfer of neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8338–8348 (2023)
Liu, S., Zhang, X., Zhang, Z., Zhang, R., Zhu, J.Y., Russell, B.: Editing conditional radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5773–5783 (2021)
Mikaeili, A., Perel, O., Safaee, M., Cohen-Or, D., Mahdavi-Amiri, A.: Sked: Sketch-guided text-based 3d editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14607–14619 (2023)
Mildenhall, B., Srinivasan, et al.: Local light field fusion: practical view synthesis with prescriptive sampling guidelines. ACM Trans. Graph. (TOG) 38(4), 1–14 (2019)
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: representing scenes as neural radiance fields for view synthesis. In: Computer Vision-ECCV 2020-16th European Conference, Glasgow, UK, 23–28 Aug 2020, Proceedings, Part I, pp. 405–421 (2020)
Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. In: ICLR (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Ren, T., et al. : Grounded sam: assembling open-world models for diverse visual tasks (2024)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2022)
Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104–4113 (2016)
Sella, E., Fiebelman, G., Hedman, P., Averbuch-Elor, H.: Vox-e: text-guided voxel editing of 3d objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 430–440 (2023)
Sun, C., Sun, M., Chen, H.T.: Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5459–5469 (2022)
Wang, C., Chai, M., He, M., Chen, D., Liao, J.: Clip-nerf: text-and-image driven manipulation of neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3835–3844 (2022)
Wang, J., Fang, J., Zhang, X., Xie, L., Tian, Q.: Gaussianeditor: editing 3d gaussians delicately with text instructions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20902–20911 (2024)
Wang, Q., et al.: Ibrnet: learning multi-view image-based rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2021)
Wang, Z., et al.: Prolificdreamer: high-fidelity and diverse text-to-3d generation with variational score distillation. In: Advances in Neural Information Processing Systems (NeurIPS) (2023)
Yu, L., Xiang, W., Han, K.: Edit-diffnerf: editing 3d neural radiance fields using 2d diffusion model (2023). arXiv:2306.09551
Zhang, K., et al.: Arf: artistic radiance fields. In: European Conference on Computer Vision, pp. 717–733 (2022)
Zhi, S., Laidlow, T., Leutenegger, S., Davison, A.J.: In-place scene labelling and understanding with implicit scene representation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15838–15847 (2021)
Acknowledgements
This work is supported in part by the NSFC (62372457, 62132021, 62325211), Young Elite Scientists Sponsorship Program by CAST (2023QNRC001), the Natural Science Foundation of Hunan Province of China (2021RC3071, 2022RC1104).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Liu, X., Xu, K., Huang, Y., Yi, R., Zhu, C. (2025). MaskEditor: Instruct 3D Object Editing with Learned Masks. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15036. Springer, Singapore. https://doi.org/10.1007/978-981-97-8508-7_20
Download citation
DOI: https://doi.org/10.1007/978-981-97-8508-7_20
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8507-0
Online ISBN: 978-981-97-8508-7
eBook Packages: Computer ScienceComputer Science (R0)