Abstract:
Recently, diffusion models have achieved remarkable success in blind image super-resolution. However, most existing methods rely solely on uni-modal degraded low-resoluti...Show MoreMetadata
Abstract:
Recently, diffusion models have achieved remarkable success in blind image super-resolution. However, most existing methods rely solely on uni-modal degraded low-resolution images to guide diffusion models for restoring high-fidelity images, resulting in inferior realism. In this letter, we propose a Multi-modal Prior-Guided diffusion model for blind image Super-Resolution (MPGSR), which fine-tunes Stable Diffusion (SD) by utilizing the superior visual-and-textual guidance for restoring realistic high-resolution images. Specifically, our MPGSR involves two stages, i.e., multi-modal guidance extraction and adaptive guidance injection. For the former, we propose a composited transformer and further incorporate it with GPT-CLIP to extract the representative visual-and-textual guidance. For the latter, we design a feature calibration ControlNet to inject the visual guidance and employ the cross-attention layer provided by the frozen SD to inject the textual guidance, thus effectively activating the powerful text-to-image generation potential. Extensive experiments show that our MPGSR outperforms state-of-the-art methods in restoration quality and convergence time.
Published in: IEEE Signal Processing Letters ( Volume: 32)