TY - GEN
T1 - Fine-tuning the Diffusion Model and Distilling Informative Priors for Sparse-view 3D Reconstruction
AU - Tang, Jiadong
AU - Gao, Yu
AU - Jiang, Tianji
AU - Yang, Yi
AU - Fu, Mengyin
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - 3D reconstruction methods such as Neural Radiance Fields (NeRFs) are capable of optimizing high-quality 3D representation from images. However, NeRF is limited by the requirement for a large number of multi-view images, making its application to real-world scenarios challenging. In this work, we propose a method that can reconstruct real-world scenes from a few input images and a simple text prompt. Specifically, we fine-tune a pretrained diffusion model to constrain its powerful priors to the visual inputs and generate 3D-aware images, leveraging the coarse renderings obtained from input images as the image condition, along with the text prompt as the text condition. Our fine-tuning method saves a significant amount of training time and GPU memory usage while also generating credible results. Moreover, to enable our method to have self-evaluation capabilities, we design a semantic switch to filter out generated images that do not match real scenes, ensuring that only informative priors from the fine-tuned diffusion model are distilled into the 3D model. The semantic switch we designed can be used as a plug-in and improve performance by 13%. We perform our approach on a real-world dataset and demonstrate competitive results compared to existing sparse-view 3D reconstruction methods. Please see our project page for more visualizations and code: https://bityia.github.io/FDfusion.
AB - 3D reconstruction methods such as Neural Radiance Fields (NeRFs) are capable of optimizing high-quality 3D representation from images. However, NeRF is limited by the requirement for a large number of multi-view images, making its application to real-world scenarios challenging. In this work, we propose a method that can reconstruct real-world scenes from a few input images and a simple text prompt. Specifically, we fine-tune a pretrained diffusion model to constrain its powerful priors to the visual inputs and generate 3D-aware images, leveraging the coarse renderings obtained from input images as the image condition, along with the text prompt as the text condition. Our fine-tuning method saves a significant amount of training time and GPU memory usage while also generating credible results. Moreover, to enable our method to have self-evaluation capabilities, we design a semantic switch to filter out generated images that do not match real scenes, ensuring that only informative priors from the fine-tuned diffusion model are distilled into the 3D model. The semantic switch we designed can be used as a plug-in and improve performance by 13%. We perform our approach on a real-world dataset and demonstrate competitive results compared to existing sparse-view 3D reconstruction methods. Please see our project page for more visualizations and code: https://bityia.github.io/FDfusion.
UR - http://www.scopus.com/inward/record.url?scp=85216455789&partnerID=8YFLogxK
U2 - 10.1109/IROS58592.2024.10802155
DO - 10.1109/IROS58592.2024.10802155
M3 - Conference contribution
AN - SCOPUS:85216455789
T3 - IEEE International Conference on Intelligent Robots and Systems
SP - 7437
EP - 7444
BT - 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2024
Y2 - 14 October 2024 through 18 October 2024
ER -