Multi-Objective GFlowNets

Introduction

논문에서는 drug discovery, material design과 같이 실제 응용에서 다수의 conflicting objective를 동시에 만족하는 candidate를 생성해야 하는 문제를 다룬다.

예를 들어 in-silico drug discovery에서는 target에 효과적으로 binding하면서도 synthesizability가 좋고, 인체에 안전한 molecule을 동시에 만족해야 한다고 설명하고 있다. 이러한 objective들은 서로 mutually incompatible할 수 있기 때문에, 모든 objective를 동시에 최대화하는 single solution은 존재하지 않는다.

Multi-Objective Optimization (MOO)에서는 이 문제를 Pareto optimality 개념으로 다루며, objective 간의 best trade-off를 제공하는 candidate들의 집합을 찾는 것을 목표로 한다고 이야기한다.

논문에서는 기존 MOO 접근법들이 간과한 중요한 측면을 강조한다. 첫째, optimization 대상이 되는 objective들은 실제로는 imperfect proxy인 경우가 많다. 예를 들어 target에 대한 binding affinity는 인체 내 inhibitory effect의 imperfect approximation에 불과하다. 둘째, 이러한 imperfect proxy를 다룰 때는 단순히 Pareto front를 cover하는 것을 넘어, 각 Pareto optimal solution에 대해 diverse한 candidate set을 생성해야 expensive downstream evaluation (in-vivo test, clinical trial)에서 성공할 가능성이 높아진다고 주장한다.

저자들은 GFlowNet이 reward maximization 관점이 아니라 reward에 proportional하게 candidate를 sampling하는 특성을 가지므로, MOO의 diverse candidate generation에 적합하다고 본다. 이러한 관찰을 바탕으로 Multi-Objective GFlowNet (MOGFN)을 제안한다.

Contribution

논문에서 제시한 주요 contribution은 다음과 같다.

C1: MOO에서 diverse candidate generation이라는 이전에 다루어지지 않은 문제를 해결하기 위한 MOGFN framework를 제안한다.
C2: Molecule generation, sequence generation task에서 MOGFN-PC가 diverse Pareto-optimal candidate를 생성함을 보인다. 이는 reward-conditional GFlowNet의 첫 successful empirical validation이라고 주장한다.
C3: Fluorescent protein design task에서 MOGFN-AL이 sample efficiency와 diversity를 모두 개선한다고 보고한다.
C4: MOGFN의 핵심 component에 대한 분석을 통해 design choice가 성능에 미치는 영향을 분석한다.

Background

Multi-Objective Optimization

MOO는 $d$개의 objective $R(x) = [R_1(x), \ldots, R_d(x)]$를 동시에 maximize하는 feasible candidate $x^\star \in \mathcal{X}$를 찾는 문제로 정의된다.

\[\max_{x \in \mathcal{X}} R(x)\]

위 수식에서 $R_i(x)$는 $i$번째 objective function을 의미한다. Objective들이 conflicting할 경우, 모든 objective를 동시에 maximize하는 single $x^\star$는 존재하지 않으므로 Pareto optimality 개념을 도입한다.

$x_1, x_2 \in \mathcal{X}$에 대해, $x_1$이 $x_2$를 dominate한다 ($x_1 \succ x_2$)는 것은 모든 $i$에 대해 $R_i(x_1) \geq R_i(x_2)$이고, 적어도 하나의 $k$에 대해 $R_k(x_1) > R_k(x_2)$인 경우를 의미한다. Pareto-optimal candidate는 자신을 dominate하는 다른 solution이 존재하지 않는 candidate이다.

논문에서는 objective가 injective하지 않을 수 있으므로, Pareto front의 한 점에 대응하는 candidate가 여러 개 존재할 수 있다고 설명한다. 이는 candidate space에서의 diversity를 의미하며, in-silico drug discovery 같은 응용에서 매우 중요한 특성이라고 주장한다.

Scalarization

Scalarization은 objective $R_i$에 weight $\omega_i$ ($\omega_i \geq 0$, $\sum_i \omega_i = 1$)를 할당하여 MOO 문제를 single-objective sub-problem으로 분해하는 방법이다.

대표적으로 Weighted Sum scalarization은 다음과 같이 정의한다.

\[R(x|\omega) = \sum_{i=1}^{d} \omega_i R_i(x)\]

Weighted Tchebycheff scalarization은 non-convex Pareto front에서도 모든 Pareto-optimal solution을 capture할 수 있다고 설명한다.

\[R(x|\omega) = \min_{1 \leq i \leq d} \omega_i |R_i(x) - z_i^\star|\]

위 수식에서 $z_i^\star$는 objective $R_i$에 대한 ideal value를 의미한다.

GFlowNets

GFlowNet은 compositional object $x \in \mathcal{X}$를 sequential step을 통해 reward $R(x)$에 proportional한 probability로 generation하는 stochastic policy를 학습하는 probabilistic model이다.

학습 objective로 trajectory balance (TB)를 사용하며, forward policy $P_F(- \mid s; \theta)$, backward policy $P_B(- \mid s; \theta)$, partition function $Z_\theta$를 parameterize한다. 학습된 policy는 $\pi(x) \approx R(x)/Z$를 만족하도록 최적화된다.

Method

MOGFN-PC: Preference-Conditional GFlowNets

논문에서는 각 sub-problem을 별도 GFlowNet으로 푸는 naive 방식이 computational cost와 sub-problem 간 shared structure를 활용하지 못하는 한계가 있다고 지적한다.

이를 해결하기 위해 reward-conditional GFlowNet을 활용하여 preference $\omega \in \Delta^d$를 conditioning variable로 받는 single conditional policy를 학습한다. 이렇게 하면 모든 sub-problem을 단일 model로 simultaneously 처리할 수 있고, training 시 보지 못한 $\omega$에 대해서도 generalize할 수 있다고 주장한다.

MOGFN-PC의 학습 objective는 trajectory balance를 conditional하게 확장한 형태이다.

\[\mathcal{L}(\tau, \omega; \theta) = \left( \log \frac{Z_\theta(\omega) \prod_{s \to s' \in \tau} P_F(s' \mid s, \omega; \theta)}{R(x \mid \omega) \prod_{s \to s' \in \tau} P_B(s \mid s', \omega; \theta)} \right)^2\]

위 수식에서 $\tau$는 generated trajectory를 의미하고, $\omega$는 preference vector를 나타낸다. $Z_\theta(\omega)$는 preference에 대한 conditional partition function이며, $P_F$와 $P_B$는 각각 forward, backward conditional policy를 의미한다.

논문에서는 새로운 scalarization으로 Weighted-log-sum (WL)을 제안한다.

\[R(x \mid \omega) = \prod_{i=1}^{d} R_i(x)^{\omega_i}\]

WL은 log space에서의 weighted sum으로 해석될 수 있으며, 모든 objective를 동시에 최적화해야 하는 경우 single reward에 의해 dominate되는 Weighted Sum의 한계를 보완할 수 있다고 가정한다.

또한 reward exponent $\beta$를 적용하여 $\pi(x \mid \omega) \propto R(x \mid \omega)^\beta$로 학습한다. $\beta$를 통해 diversity와 reward 간의 trade-off를 조절할 수 있다고 설명한다.

Algorithm 1. Training preference-conditional GFlowNets

Algorithm 1의 각 단계는 다음과 같이 해석할 수 있다.

Input/Initialize 부분에서는 preference distribution $p(\omega)$, reward exponent $\beta$, mixing coefficient $\delta$를 정의하고, conditional GFlowNet $(P_F, P_B, \log Z)$를 parameter $\theta$로 초기화한다.
매 training step마다 preference $\omega$를 $p(\omega)$로부터 sampling한다고 설명하고 있다.
$(1-\delta) P_F + \delta \text{Uniform}$ 형태의 exploration policy $\hat{\pi}$로 trajectory $\tau$를 sampling하는 과정이라고 볼 수 있다.
생성된 sample에 대해 reward $R(x \mid \omega)^\beta$와 trajectory balance loss $\mathcal{L}(\tau, \omega; \theta)$를 계산한다.
Loss gradient $\nabla_\theta \mathcal{L}(\tau, \omega)$로 parameter $\theta$를 업데이트한다고 이야기한다.

논문에서는 MOReinforce (Lin et al., 2021)와의 차이점을 강조한다. MOReinforce는 preference $\omega$가 주어졌을 때 $R(x \mid \omega)$를 maximize하는 single candidate로 수렴하지만, MOGFN-PC는 $R(x \mid \omega)$에 proportional하게 sampling하므로 diverse Pareto-optimal candidate를 생성할 수 있다고 주장한다.

MOGFN-AL: Multi-Objective Active Learning with GFlowNets

논문에서는 objective evaluation이 expensive한 경우 sample efficiency가 critical하다고 본다. 예를 들어 binding energy 평가는 imperfect simulation에서도 수 시간이 걸린다고 설명한다.

MOGFN-AL은 multi-objective Bayesian optimization loop의 각 round에서 GFlowNet으로 candidate를 생성하는 방법이며, GFlowNet-AL (Jain et al., 2022)의 multi-objective extension이다.

기존 GFlowNet 기반 sequence design에서는 token-by-token으로 sequence를 생성하지만, long sequence에서는 prohibitively slow하다는 한계가 있다고 지적한다. 이를 해결하기 위해, 각 round $i$에서 non-dominated candidate $\hat{\mathcal{P}}_i$의 sequence $x$에 대해 mutation $m = \{(l_i, v_i)\}_{i=1}^{T}$를 생성하는 방식을 제안한다.

여기서 $l \in \{1, \ldots, \lvert x \rvert\}$는 mutation 위치이고, $v \in \mathcal{A}$는 새 token, $T$는 mutation 수이다. Mutation의 reward는 acquisition function 값 $R(m, x) = a(x'^m \mid \hat{f})$로 정의한다.

Algorithm 2의 각 단계는 다음과 같이 해석할 수 있다.

Input/Initialize 부분에서는 oracle $\mathcal{R}$, initial dataset $\mathcal{D}_0$, surrogate model $\hat{f}$, acquisition function $a$, GFlowNet policy $\pi_\theta$를 정의한다.
매 round $i$마다 이전까지 누적된 dataset $\mathcal{D}_{i-1}$에 surrogate model $\hat{f}$를 fitting한다고 설명하고 있다.
$\mathcal{D}_{i-1}$에서 non-dominated candidate $\hat{\mathcal{P}}_{i-1}$를 추출한다.
Acquisition function $a(- \mid \hat{f})$를 reward로 사용하여 GFlowNet policy $\pi_\theta$를 학습하는 과정이라고 볼 수 있다. 이때 policy는 $\hat{\mathcal{P}}_{i-1}$의 sequence에 대한 mutation을 생성하도록 학습된다.
$\hat{\mathcal{P}}_{i-1}$에서 sequence $x'_i$를 sampling하고 mutation $m_i$를 적용하여 batch $\mathcal{B}$를 구성한다고 이야기한다.
Batch를 oracle $\mathcal{R}$로 평가하여 새 dataset $\hat{\mathcal{D}}_i$를 생성하고 dataset $\mathcal{D}_i = \hat{\mathcal{D}}_i \cup \mathcal{D}_{i-1}$로 업데이트한다.
최종적으로 approximate Pareto set $\hat{\mathcal{P}}_N$을 반환한다.

Experiment & Result

HyperGrid Task

논문에서는 MOGFN-PC가 preference-conditional reward distribution을 잘 capture하는지 확인하기 위해 multi-objective HyperGrid task를 사용한다. State space가 작아 closed form으로 학습된 distribution을 계산할 수 있다고 설명한다.

Figure 1. (a) The distribution learned by MOGFN-PC (Top) almost exactly matches the ground truth distribution (Bottom), in particular capturing all the modes, on hypergrid of size 32 × 32 with 3 objectives. (b) and (c) Pareto front of candidates generated by MOGFN-PC on the N-grams task illustrates that the MOGFN-PC can model conflicting and correlated objectives well.

Figure 1(a)에서 MOGFN-PC가 학습한 $\pi(x \mid \omega)$와 ground truth $R(x \mid \omega)$가 거의 일치하며, 모든 mode를 capture한다고 보고한다. 64개의 preference에 대해 평균 $\mathbb{E}_x[\lvert\pi(x \mid \omega) - R(x \mid \omega)/Z(\omega)\rvert] \approx 10^{-4}$의 차이를 보였다고 설명한다.

N-Grams Task

Figure 1(b)와 1(c)에서는 N-grams task의 결과를 보여준다. 3 Unigrams (conflicting objective)와 3 Bigrams (correlated objective) 모두에서 MOGFN-PC가 trade-off를 잘 modeling한다고 주장한다.

QM9 Task

QM9 dataset 기반 small-molecule generation task에서 4개의 reward signal (HOMO-LUMO gap, SA, molecular weight target, logP target)을 사용한다. Table 1에서 MOGFN-PC가 Reward, Diversity, HV, R2 모든 metric에서 baseline을 outperform한다고 보고한다.

Fragment-based Molecule Generation

Fragment-based molecule generation에서는 sEH proxy, SA, QED, molecular weight target을 사용한다. Table 2에서 MOGFN-PC가 HV, R2뿐만 아니라 candidate diversity score에서도 baseline을 consistently outperform한다고 보고한다.

DNA Sequence Design

DNA aptamer 생성 task에서는 free energy, base pair 개수, sequence length의 inverse를 objective로 사용한다. Table 3에서 MOReinforce가 MOO performance는 가장 우수하나, GCGCGC… 패턴의 quasi-trivial solution으로 수렴하여 diversity가 낮다고 설명한다. 반면 MOGFN-PC는 diversity와 Top-K reward에서 우수하지만 Pareto performance가 약간 낮다고 보고한다.

Active Learning: Proxy RFP

Figure 2. MOGFN-AL demonstrates a substantial advantage in terms of (a) Relative Hypervolume, and (b) the Pareto frontier of candidates generated by MOGFN-AL dominates the Pareto front of the initial dataset, while being more diverse (c) than the baselines’.

Proxy RFP task는 folding stability와 solvent-accessible surface area (SASA)를 최적화하여 novel red fluorescent protein을 발굴하는 task이다.

Figure 2(a)에서 MOGFN-AL이 동일 black-box evaluation budget에서 relative hypervolume 측면에서 LaMBO를 약 절반의 budget으로 따라잡는다고 보고한다. Figure 2(b)는 MOGFN-AL이 생성한 Pareto frontier가 initial dataset의 Pareto front를 dominate함을 보여준다. Figure 2(c)에서 DIAMOND 기반 e-value로 측정한 diversity가 baseline들보다 높다고 설명한다.

Analysis

Figure 3. Analysing the effect of p(ω), R(x|ω) and β in the 3 Bigrams (left) and fragment-based molecule generation (right) tasks. The metrics are normalized by the maximum value obtained. As key takeaways, Dirichlet(α = 1) for p(ω) and weighted sum (WS) scalarization have the best performance in both tasks while the choice of β is task-dependent.

논문에서는 MOGFN-PC의 핵심 component인 $p(\omega)$, $\beta$, scalarization function $R(x \mid \omega)$에 대한 분석을 수행한다.

$p(\omega)$를 $\text{Dirichlet}(\alpha)$로 두고 $\alpha \in {0.1, 1, 10}$을 비교한 결과, $\alpha = 1$ (uniform distribution)이 가장 좋은 성능을 보였다고 보고한다. 다만 $\alpha = 0.1$, $\alpha = 10$의 skewed distribution에서도 비슷한 성능을 보였으며, MOGFN-PC가 training 중 보지 못한 preference에도 interpolate할 수 있다고 주장한다.

Scalarization function 측면에서는 Weighted Sum (WS)이 Weighted-log-sum (WL)과 Weighted Tchebycheff (WT)보다 우수하다고 보고한다. WT와 WL의 less smooth reward landscape가 성능 저하의 원인이라고 추측한다.

$\beta$의 경우, 값이 클수록 reward density가 mode 근처에 집중되어 Pareto performance가 향상되지만 diversity는 감소한다고 설명한다. 또한 너무 큰 $\beta$는 optimization 자체를 어렵게 만들어 성능을 저하시킨다고 본다. 따라서 $\beta$는 task-specific parameter로서 trade-off를 조정하는 데 사용된다고 이야기한다.

Limitation

논문에서는 다음과 같은 한계를 인정한다.

Reward가 single mode를 가지는 문제에서는 MOGFN의 이점이 제한적이라고 설명한다. DNA aptamer 생성 task에서 이러한 양상이 관찰되었다.
실제 응용에서는 Pareto front 전체가 아닌 특정 region만 필요한 경우가 있으며, 이를 위한 structured exploration 방법이 future work로 남아 있다고 이야기한다.
MOGFN-AL의 경우, preference-conditional acquisition function 개발이 향후 연구 방향으로 제시된다.
MOGFN-PC가 conditional distribution을 modeling하므로 single-objective GFlowNet 대비 model capacity 요구가 더 크며, 더 나은 training objective가 필요할 수 있다고 본다.

Reference

Jain, Moksh, et al. “Multi-objective gflownets.” International conference on machine learning. PMLR, 2023.