RePaint: Inpainting using denoising diffusion probabilistic models

Introduction

Image inpainting은 image의 missing region을 주변 image와 의미적으로 자연스럽게 채워 넣는 task로 정의된다.

논문에서는 기존 state-of-the-art inpainting 방법론이 GAN 또는 autoregressive model을 기반으로 하며, 특정 mask distribution에 맞춰 학습하기 때문에 학습에 사용되지 않은 mask type에 대한 일반화 성능이 부족하다고 설명하고 있다.

또한 pixel-wise loss와 perceptual loss를 사용하는 학습 방식은 missing region에 단순한 texture extension을 생성하는 경향이 있으며, 의미적으로 유의미한 generation을 어렵게 만든다고 주장한다.

저자들은 이러한 한계를 극복하기 위해 pretrained unconditional DDPM을 generative prior로 활용하는 RePaint를 제안한다.

RePaint는 inpainting task 자체를 위해 model을 별도로 학습하지 않고, reverse diffusion iteration에서 unmasked region을 주어진 image information으로 sampling하여 생성 과정을 조건화한다.

Figure 1. We use Denoising Diffusion Probabilistic Models (DDPM) for inpainting. The process is conditioned on the masked input (left). It starts from a random Gaussian noise sample that is iteratively denoised until it produces a high-quality output. Since this process is stochastic, we can sample multiple diverse outputs. The DDPM prior forces a harmonized image, is able to reproduce texture from other regions, and inpaint semantically meaningful content.

Figure 1은 RePaint의 동작 방식을 보여준다. Masked input image를 입력으로 받아 random Gaussian noise에서 시작하는 iterative denoising 과정을 거쳐 inpainted image를 생성한다. Stochastic 한 sampling 과정 덕분에 동일한 input에 대해서도 다양한 output을 생성할 수 있다고 설명하고 있다.

Contribution

논문에서 강조하는 주요 기여는 다음과 같이 정리할 수 있다.

Pretrained unconditional DDPM을 그대로 사용하여 inpainting을 수행하는 mask-agnostic method를 제안하였다.
Network 자체를 수정하거나 mask-conditional training을 수행하지 않기 때문에 inference 시점에 임의의 mask shape에 일반화 가능하다고 주장한다.
Standard DDPM sampling 전략이 의미적으로 부정확한 결과를 생성하는 문제를 해결하기 위해 resampling 기법을 제안하였다.
CelebA-HQ와 ImageNet에서 다양한 mask setting에 대해 GAN-based, autoregressive-based state-of-the-art method와 비교하여 우수한 성능을 보고하였다.

Preliminaries: DDPM

논문에서는 method를 설명하기 위해 DDPM의 기본 구성을 정리하고 있다.

DDPM은 forward process를 통해 image $x_0$를 $T$ step에 걸쳐 white Gaussian noise $x_T \sim \mathcal{N}(0, I)$로 변환한다. Forward 한 step은 다음과 같이 정의된다.

\[q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\, x_{t-1}, \beta_t \mathbf{I})\]

위 수식에서 $\beta_t$ 는 timestep $t$ 에서 추가되는 noise variance를 의미하고, $\sqrt{1-\beta_t}$ 는 이전 sample을 scaling 하는 factor를 의미한다.

Reverse process는 neural network로 modeling되며, 다음과 같이 표현한다.

\[p_{\theta}(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_{\theta}(x_t, t), \Sigma_{\theta}(x_t, t))\]

Forward process는 Markov chain이기 때문에 noise variance의 누적값 $\bar{\alpha}t = \prod{s=1}^{t} (1-\beta_s)$ 를 이용하여 임의의 timestep $t$ 에서의 sample을 한 번에 계산할 수 있다.

\[q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}\, x_0, (1-\bar{\alpha}_t) \mathbf{I})\]

위 수식은 임의의 timestep $t$ 에서 known region을 sampling하기 위해 사용된다고 설명하고 있다.

Method

Conditioning on the Known Region

논문에서는 image inpainting을 ground truth image $x$ 에 대해 unknown region $m \odot x$ 와 known region $(1-m) \odot x$ 를 분리하여 정의한다. 여기서 $m$ 은 binary mask를 의미한다.

저자들은 reverse step에서 $x_t$ 만 주어지면 $x_{t-1}$ 이 결정된다는 점을 이용하여, known region의 distribution을 유지하면서 $(1-m) \odot x_t$ 를 임의로 변경할 수 있다고 주장한다.

따라서 unknown region은 trained DDPM으로부터, 알려진 영역은 위 수식에서 직접 sampling하여 다음과 같이 reverse step을 수행한다.

\[x_{t-1}^{\text{known}} \sim \mathcal{N}(\sqrt{\bar{\alpha}_t}\, x_0, (1-\bar{\alpha}_t) \mathbf{I})\] \[x_{t-1}^{\text{unknown}} \sim \mathcal{N}(\mu_{\theta}(x_t, t), \Sigma_{\theta}(x_t, t))\] \[x_{t-1} = m \odot x_{t-1}^{\text{known}} + (1 - m) \odot x_{t-1}^{\text{unknown}}\]

위 수식에서 $x_{t-1}^{\text{known}}$ 은 ground truth image $x_0$ 로부터 noise만 추가하여 얻는 known region이고, $x_{t-1}^{\text{unknown}}$ 은 model이 예측한 reverse step 결과이다. 두 결과는 mask $m$ 을 이용해 결합되어 새로운 $x_{t-1}$ 을 형성한다.

Figure 2. Overview of our approach. RePaint modifies the standard denoising process in order to condition on the given image content. In each step, we sample the known region (top) from the input and the inpainted part from the DDPM output (bottom).

Figure 2는 RePaint의 reverse step 과정을 도식화하여 보여준다. 매 step마다 known region은 input image에서 sampling하고, unknown region은 DDPM output에서 sampling한 후 mask로 결합한다고 설명하고 있다.

Resampling

논문에서는 위 수식을 그대로 적용하면 unknown region이 known region과 texture 수준에서는 일치하지만, 의미적으로는 부자연스러운 결과가 생성되는 문제를 관찰하였다고 보고한다.

저자들은 그 원인을 다음과 같이 분석한다. Known region의 sampling은 generated unknown region과 무관하게 수행되기 때문에, model이 매 step마다 image를 harmonize하려 해도 같은 문제가 다음 step에서 반복된다. 또한 variance schedule $\beta_t$로 인해 reverse step에서 변화 가능한 폭이 점점 작아져, 후반 step에서는 boundary 불일치를 수정할 수 있는 flexibility가 부족해진다고 이야기한다.

이를 해결하기 위해 논문에서는 resampling 전략을 제안한다. Reverse step으로 얻은 $x_{t-1}$을 다시 forward step으로 $x_t$에 가깝게 되돌리는 방식이다.

\[x_t \sim \mathcal{N}(\sqrt{1-\beta_t}\, x_{t-1}, \beta_t \mathbf{I})\]

이 연산은 output에 noise를 다시 추가하지만, $x_{t-1}^{\text{unknown}}$에 포함된 information이 부분적으로 보존되기 때문에 다음 reverse step에서 known region과 더 잘 harmonize된 결과를 얻을 수 있다고 설명한다.

또한 저자들은 한 step만 resampling하는 경우에는 entire denoising process 전반에 걸쳐 semantic information을 통합하기 어렵다고 보고, jump length $j$라는 개념을 도입한다. Jump length는 resampling 시 되돌아가는 timestep의 horizon을 의미한다.

Figure 3. The effect of applying $n$ sampling steps. The first example with $n = 1$ is the DDPM baseline, the second with $n = 2$ is with one resample step. More resampling steps lead to more harmonized images. The benefit saturates at about $n = 10$ resamplings.

Figure 3은 resampling 횟수 $n$에 따른 inpainting 결과 변화를 보여준다. $n=1$은 standard DDPM baseline이고, $n$이 커질수록 image가 더 harmonize된다고 보고한다. 다만, 약 $n=10$ 정도에서 benefit이 포화된다고 설명하고 있다.

Algorithm

논문에서 제안하는 RePaint inpainting algorithm은 다음과 같이 정리된다.

각 line에 대한 해석은 다음과 같이 정리할 수 있다.

Line 1에서는 standard DDPM과 동일하게 Gaussian noise $x_T$로부터 reverse process를 시작한다고 설명하고 있다.
Line 2–13에서는 timestep $T$부터 $1$까지 reverse process를 반복하는 outer loop이라고 볼 수 있다.
Line 3–12에서는 각 timestep마다 resampling을 $U$회 수행하는 inner loop을 정의한다.
Line 4–5에서는 ground truth image $x_0$로부터 known region $x_{t-1}^{\text{known}}$을 forward process로 직접 sampling한다고 이야기한다.
Line 6–7에서는 trained DDPM으로 unknown region $x_{t-1}^{\text{unknown}}$을 reverse step으로 sampling한다.
Line 8에서는 mask $m$을 이용해 known region과 unknown region을 결합하여 새로운 $x_{t-1}$을 구성하는 conditioning step이라고 볼 수 있다.
Line 9–11에서는 마지막 resampling iteration이 아니고 첫 timestep이 아닌 경우, $x_{t-1}$을 다시 $x_t$로 forward sampling하여 resampling을 수행한다고 설명하고 있다.

Experiment

Implementation Details

논문에서는 CelebA-HQ와 ImageNet dataset에서 실험을 수행하였다고 보고한다. Pretrained guided diffusion model을 ImageNet에 그대로 사용하였고, CelebA-HQ는 256×256 resolution에서 4×V100 GPU로 250,000 iteration 동안 학습하였다고 설명한다.

최종 setting은 $T=250$ timestep, $r=10$ resampling 횟수, jump length $j=10$을 사용하였다고 명시한다.

Metrics

평가에는 user study와 LPIPS metric을 함께 사용하였다고 설명한다. User study는 각 image마다 다섯 명의 user가 RePaint와 baseline 중 어느 쪽이 더 realistic 한지 선택하는 방식으로 진행되며, self-consistency가 75% 이상인 응답만 유효한 것으로 집계하였다고 보고한다.

Comparison with State-of-the-Art

논문에서는 AOT, DSI, ICT, DeepFillv2, LaMa와 비교 실험을 수행하였다.

Figure 4. CelebA-HQ Qualitative Results. Comparison against the state-of-the-art methods for Face Inpainting over several mask settings. Zoom-in for better details.

Figure 4는 CelebA-HQ에서의 qualitative comparison 결과를 보여준다. ICT는 global consistency가 떨어지고, LaMa는 checkerboard artifact가 발생하는 반면, RePaint는 다양한 mask 조건에서 가장 자연스러운 결과를 생성한다고 주장한다.

Figure 5. ImageNet Qualitative Results. Comparison against the state-of-the-art methods for pluralistic inpainting methods over different mask settings. Zoom-in for better details.

Figure 5는 ImageNet에서의 qualitative 결과를 보여준다. Thin mask와 thick mask 모두에서 RePaint가 sharp하고 semantically meaningful 한 image를 생성한다고 보고한다.

Table 1. CelebA-HQ (top) and ImageNet (bottom) Quantitative Results. Comparison against the state-of-the-art methods. We compute the LPIPS (lower is better) and Votes for six different mask settings. Votes refers to the ratio of votes with respect to ours.

Table 1은 CelebA-HQ와 ImageNet에 대한 LPIPS와 user vote의 정량 비교 결과를 보여준다. 논문에서는 6개의 mask setting 중 최소 5개에서 RePaint가 baseline을 능가한다고 보고한다.

논문에서는 thin mask인 “Super-Resolution 2×”와 “Alternating Lines” setting에서 RePaint가 73.1%–99.3%의 user vote를 차지하였다고 보고한다.

반면 thick mask인 “Expand”와 “Half” setting에서는 LaMa가 LPIPS 기준으로 RePaint를 약간 앞서는 경우가 있다고 설명한다. 다만 저자들은 RePaint가 의미적으로 다양한 image를 생성하기 때문에 ground truth와 다른 plausible image를 만드는 경향이 있고, 이러한 경우 LPIPS가 적절한 metric이 아닐 수 있다고 주장한다.

Diversity Analysis

논문에서는 reverse diffusion step이 매번 Gaussian distribution에서 new noise를 sampling하기 때문에 inherently stochastic 하고, inpainted region에 어떠한 loss도 직접 부여하지 않기 때문에 training set distribution에 부합하는 다양한 image를 생성할 수 있다고 설명한다.

Class-Conditional Experiment

Figure 6. Visual results for class guided generation on ImageNet.

Figure 6은 pretrained ImageNet DDPM의 class-conditional 능력을 활용한 inpainting 결과를 보여준다. “Expand” mask 환경에서 동일한 input에 대해 “Granny Smith”, “Head Cabbage”, “Broccoli”, “Cauliflower” 등 서로 다른 class로 conditioning한 결과를 제시하고 있다.

Ablation Study

논문에서는 resampling이 단순히 computational budget을 늘려서 성능이 향상되는 것이 아님을 보이기 위해, diffusion process를 단순히 slow down한 경우와 비교하는 실험을 수행하였다.

Table 2. Analysis of the use of computational budget. We compare slowing down the diffusion process and resampling. We use the ImageNet validation set with 32 images over the LaMa Wide mask setting. The number of diffusion steps is denoted by $T$, and the number of resamplings by $r$.

Table 2의 결과를 보면 동일한 computational budget에서도 slowing down은 visible improvement가 없지만, resampling은 LPIPS가 일관되게 감소한다고 보고한다.

Figure 7. Qualitative Analysis of the use of computational budget. RePaint produces higher visual quality with the same computational budget by resampling (bottom) compared to slowing down the diffusion process (top). The number of diffusion steps is denoted by $T$ and resamplings by $r$.

Figure 7은 slowing down과 resampling의 qualitative 비교 결과를 보여준다. 동일한 budget을 사용했을 때 resampling이 더 harmonize된 결과를 생성한다고 설명하고 있다.

추가로 jump length $j$와 resampling 횟수 $r$에 대한 ablation 결과를 다음과 같이 정리한다.

Table 3. Ablation Study. Analysis of length of the jumps $j$ and number of resamplings $r$. We report LPIPS and the average user-study votes with respect to LaMa. We use 32 images from the CelebA validation set, and use the LaMa Wide mask setting.

Table 3에서 저자들은 jump length가 클수록 더 좋은 성능을 보고하고 있으며, $j=1$의 경우 DDPM이 blurry image를 생성하는 경향이 있다고 분석한다.

마지막으로 RePaint의 resampling을 SDEdit의 resampling schedule과 비교한 결과를 다음과 같이 보고한다.

Table 4. Comparison with the resampling schedule proposed in SDEdit in terms of LPIPS. The resampling method proposed in our RePaint (Sec. 4.2) achieves substantially better results, in particular for the Super-Resolution masks.

Table 4의 결과에서, RePaint의 resampling 전략이 SDEdit 방식보다 대부분의 mask type에서 더 우수한 LPIPS를 달성하였다고 보고한다. 특히 super-resolution mask에서 LPIPS가 53% 이상 감소한다고 강조한다.

Limitation

논문에서는 RePaint의 한계로 다음 두 가지를 언급하고 있다.

DDPM 기반의 per-image optimization 과정이 GAN, autoregressive method 대비 inference 속도가 현저히 느리다고 인정한다. Real-time application에 적용하기에는 어렵다고 평가한다.
Extreme mask case에서는 ground truth와 매우 다른 plausible image를 생성할 수 있다. 이는 다양성 측면에서는 장점이지만, LPIPS 같은 reference-based metric으로 평가하기 어렵다는 한계가 있다고 이야기한다. FID score를 대안으로 제시하지만, 신뢰 가능한 FID 계산을 위해서는 1,000개 이상의 image가 필요하고 현재 DDPM 기준으로는 runtime이 매우 부담스럽다고 보고한다.

Conclusion

RePaint는 별도의 mask-specific training 없이 pretrained unconditional DDPM만으로 free-form image inpainting을 수행할 수 있는 방법론이라고 정리할 수 있다.

논문에서 제안한 resampling 전략은 reverse diffusion process에서 known region과 unknown region 사이의 harmonization을 효과적으로 수행한다고 주장하며, 다양한 mask type에서 state-of-the-art 성능을 달성하였다고 보고한다.

다만 inference speed와 reference-based metric의 한계는 여전히 남아 있는 문제라고 볼 수 있다.

Reference

Lugmayr, Andreas, et al. “Repaint: Inpainting using denoising diffusion probabilistic models.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.