ChemMORT: an automatic ADMET optimization platform using deep learning and multi-objective particle swarm optimization

Introduction

신약 개발은 약 15년의 시간과 $12–15 million에 이르는 막대한 비용이 소요되는 과정이다. 논문에서는 combinatorial chemistry, X-ray crystallography, high-throughput screening (HTS), virtual screening (VS) 등 다양한 혁신 기술이 도입되었음에도, 임상 후보물질 개발 과정의 attrition rate가 여전히 약 75%에 이른다고 설명하고 있다. 저자들은 이러한 실패 원인의 약 50%가 absorption, distribution, metabolism, elimination, toxicity (ADMET) property 부족에서 비롯된다고 주장한다.

ADMET optimization은 여러 property를 동시에 개선해야 하면서도 biological potency를 유지해야 하는 multi-parameter optimization 문제이다. 저자들은 vast하고 discrete한 drug-like chemical space와 제한된 expert knowledge로 인해 이 task가 매우 도전적이라고 이야기한다.

논문에서는 deep learning을 활용한 molecular representation 학습이 이러한 문제를 효율적으로 해결할 수 있는 방향으로 제시되고 있다. Deep neural network가 학습한 chemical space는 smooth, continuous, unique, expressive하다는 특성을 가지며, 이는 molecular optimization에 효과적인 환경을 제공한다고 설명하고 있다. 특히 Gomez-Bombarelli et al.이 제안한 variational autoencoder 기반 접근은 reversibility라는 추가적인 특성을 제공하여 inverse QSAR 문제 해결에 강점이 있다고 이야기한다.

기존 연구에서는 단순 reconstruction task로는 syntactic concept이나 repetitive pattern만 학습할 수 있다는 한계가 있다고 보고된다. 이를 해결하기 위해, 동일한 분자에 대한 서로 다른 SMILES representation 간 translation task로 전환하여 higher-level molecular feature를 학습하는 전략이 제안되었다. 또한 enumerated SMILES로 학습된 model이 data augmentation과 중요한 molecular feature 학습에 더 유리하다고 알려져 있다.

Contribution

저자들은 ADMET endpoints의 multi-objective optimization을 자동화하는 freely available platform인 ChemMORT (Chemical Molecular Optimization, Representation and Translation)를 제안한다. 논문에서 강조하는 주요 contribution은 다음과 같다.

17 million의 enumerated SMILES로 학습된 SMILES Encoder를 통해 512-dimensional latent representation을 생성하고, Descriptor Decoder가 이를 다시 canonical SMILES로 복원하는 reversible molecular representation을 구축하였다.
Reversible latent representation과 particle swarm optimization (PSO) strategy를 결합하여, target potency 손실 없이 다수의 ADMET property를 최적화하는 inverse QSAR design을 가능하게 하였다.
Similarity constraint와 substructure constraint를 도입하여 bioactivity motif와 reference molecule과의 구조적 유사성을 유지함으로써, optimization 과정에서 target binding affinity가 보존되도록 설계하였다.
Case study로 poly (ADP-ribose) polymerase-1 (PARP-1) inhibitor의 constrained multi-objective optimization을 수행하여, 실제 lead optimization 시나리오에서 platform의 utility를 검증하였다고 보고한다.

Method

Neural Translation Model

논문에서는 human language neural network translation model에서 영감을 받아, SMILES notation 기반의 sequence-to-sequence (seq2seq) model을 사용한다고 설명하고 있다. Encoder는 enumerated SMILES를 고정 길이 vector로 변환하고, decoder는 이 vector를 canonical SMILES로 복원하는 구조이다. 이 고정 길이 vector는 structure modification과 property optimization을 연결하는 매개체로 활용된다.

저자들은 RNN backend가 가지는 vanishing/exploding gradient 문제를 완화하기 위해 encoder와 decoder 모두 3개의 stacked Gate Recurrent Unit (GRU) layer를 사용한다고 이야기한다. Encoder의 마지막에는 information bottleneck 역할을 하는 fully connected layer (512 units, hyperbolic tangent activation function)가 배치되어, 가장 statistically salient한 molecular feature를 capture하는 512-dimensional latent representation을 생성한다.

Decoder는 latent representation을 input으로 받아 1024, 512, 256 unit으로 구성된 3개의 stacked GRU layer를 통과시킨다. 각 time step의 input은 직전 time step의 output과 ground truth의 embedding이 함께 사용된다.

Training phase에서는 decoder output과 ground truth 간 cross-entropy loss를 계산하여 gradient update를 수행한다. Prediction phase에서는 beam search algorithm을 적용하여, 가장 유망한 node를 확장하면서 character 단위로 sequence를 생성한다고 설명하고 있다.

Dataset은 1.7 million의 accessible molecule로 구성되며, 9:1 비율로 training set (1.53 million)과 test set (0.17 million)으로 분할된다. 각 molecule은 10개의 서로 다른 enumerated SMILES로 변환되어 encoding에 사용된다. 논문에서는 encoder를 enumerated SMILES로, decoder를 canonical SMILES로 학습할 때 translation correction rate와 chemical space breadth 사이의 균형이 가장 좋다고 보고한다.

ADMET Prediction Model

논문에서는 ChEMBL, EPA, DrugBank database에서 수집된 약 30,000개의 entry를 기반으로 ADMET dataset을 구축하였다고 설명하고 있다. 데이터 전처리는 Molecular Operating Environment (MOE, version 2016)을 사용한다.

수집된 ADMET endpoint는 logD7.4, LogS, Caco-2, MDCK cells, plasma protein binding rate (PPB), AMES toxicity, hERG toxicity, hepatotoxicity, median lethal dose (LD50)의 9가지이다. 저자들은 encoder가 생성한 512-dimensional descriptor와 XGBoost algorithm을 결합하여 9개의 ADMET prediction model을 구축한다고 이야기한다. 추가로 SlogP, quantitative estimate of drug-likeness (QED) score, synthetic accessibility (SA) score 등 calculated property도 ChemMORT에 포함된다.

Scoring Scheme

논문에서는 multi-parameter optimization 결과를 정량적으로 평가하기 위해 weighted scoring scheme을 제안한다. 개별 property의 desirability는 recommended value range를 기준으로 다음과 같이 정의된다.

Value가 optimal range 내에 있는 경우 1
Value가 optimal range를 벗어났지만 recommended value range 내에 있는 경우 $(0, 1)$ 사이 값
Value가 recommended value range를 벗어난 경우 0

각 property에는 task별 importance에 따라 customized weight가 부여되며, final score는 다음과 같이 weighted average로 계산된다.

\[F = \frac{\sum_{i=1}^{j} S_i \cdot W_i}{\sum_{i=1}^{j} W_i}\]

위 수식에서 $j$는 optimization에 사용된 objective function의 개수를 의미하고, $S_i$는 $i$번째 objective function의 desirability score를 나타낸다. $W_i$는 해당 objective의 priority에 해당하는 weight를 의미한다. Final score 값이 낮을수록 undesirable optimization, 높을수록 acceptable optimization을 의미한다.

Particle Swarm Optimization

논문에서는 continuous representation과 scoring scheme을 결합한 search 전략으로 particle swarm optimization (PSO)을 채택한다고 설명하고 있다. PSO는 새 무리의 flocking이나 물고기 무리의 schooling behavior에서 영감을 받은 stochastic optimization method로, swarm 내 개체들이 위치 $x$와 속도 $v$를 가지고 search space를 탐색하면서 정보를 교환한다.

각 iteration $k$에서 $i$번째 particle은 자신이 방문한 historical best point에 의해 영향을 받는다.

\[x_i^{best} = \arg\max f(x_i^k)\]

또한 swarm 전체의 historical best point에도 영향을 받는다.

\[x^{best} = \arg\max f(x_i^{best})\]

각 iteration이 끝나면 particle은 자신의 상태와 수집한 정보를 바탕으로 velocity와 position을 업데이트한다.

\[v_i^{k+1} = w v_i^k + c_1 r_1 (x_i^{best} - x_i^k) + c_2 r_2 (x^{best} - x_i^k)\] \[x_i^{k+1} = x_i^k + v_i^{k+1}\]

위 수식에서 $c_1$과 $c_2$는 individual experience와 swarm experience의 기여도를 조절하는 상수를 의미한다. $r_1$과 $r_2$는 $[0, 1]$ uniform distribution에서 추출된 random number를 나타낸다. $w$는 inertia weight로, 직전 iteration의 momentum을 얼마나 유지할지를 결정한다. ChemMORT에서는 encoder output을 particle의 initial position으로 활용한다.

ChemMORT Workflow

논문에서는 ChemMORT가 SMILES Encoder, Descriptor Decoder, Molecular Optimizer의 3개 모듈로 구성된다고 설명하고 있다. 각 모듈은 descriptor 계산, molecular translation, ADMET optimization 기능을 담당한다.

SMILES Encoder는 SMILES 문자열 입력, editor에서의 분자 그리기, file upload (.sdf/.csv/.txt)를 모두 지원한다. 입력된 분자는 학습된 encoder를 통해 512-dimensional vector로 변환되고, 결과는 SMILES, molecular graph, vector 정보와 함께 csv 파일로 저장 가능하다.
Descriptor Decoder는 $[-1, 1]$ 범위의 512-dimensional vector를 입력으로 받아 decoder network를 통해 canonical SMILES로 back-engineering한다. 논문에서는 SMILES의 character-by-character 특성과 internal syntax의 fragility로 인해, 임의의 vector 조합은 invalid molecule을 출력할 수 있다고 이야기한다. Encoder와 Decoder의 결합으로 inverse design 문제 해결이 가능하다는 점이 강조된다.
Molecular Optimizer는 reversible molecular representation, credible QSAR model, structural constraint, multi-objective PSO strategy를 통합한 inverse QSAR methodology를 따른다. 사용자는 12개의 objective function (basic molecular property, SA, drug-likeness, absorption, distribution, toxicity 등) 중 선택할 수 있으며, ECFP4 fingerprint와 Tanimoto similarity 기반의 similarity constraint, 그리고 bioactivity motif 기반의 substructure constraint를 함께 설정할 수 있다.

Experiment & Result

Neural Translation Model Training

Figure 2. The translation accuracy for the training set and test set during the 160 k training steps.

논문에서는 multi-layer GRU network에 input dropout, bottleneck layer, Gaussian noise term을 적용하여 model을 학습하였다고 설명하고 있다. Batch size 64, dropout ratio 0.15, embedding noise 0.05의 설정이 사용된다. Figure 2에서는 training set과 test set의 translation accuracy가 초기에 빠르게 상승한 후 안정되는 양상을 보인다.

저자들은 training set과 test set 모두에서 최종 average single character accuracy가 99.8%에 도달하였다고 보고한다. 이는 전체 예측 길이에 대해 올바르게 예측된 character의 비율을 의미한다. 논문에서는 이러한 결과가 seq2seq model의 reliability를 입증하며, latent space가 ADMET prediction과 optimization에 활용 가능한 molecular descriptor로 기능할 수 있음을 시사한다고 해석한다.

ADMET Predictive Model Validation

Table 1: The performance of ADMET prediction models in ChemMORT.

ADMET prediction model의 평가는 ChemSAR의 ‘Diverse training set split’ module을 활용하여 chemical space distribution 기반으로 75:25의 비율로 분할한 training/test set에서 수행된다. Regression model에는 RMSE, MAE, $R^2$가, classification model에는 accuracy, sensitivity, AUC가 evaluation metric으로 사용된다.

논문에서는 regression model 평균 성능이 5-fold cross validation에서 RMSE 0.442, $R^2$ 0.747, test set에서 RMSE 0.437, $R^2$ 0.752를 기록하였다고 보고한다. Classification model의 경우 5-fold cross validation에서 accuracy 0.763, AUC 0.836, test set에서 accuracy 0.780, AUC 0.846의 성능을 보였다고 이야기한다.

저자들은 이러한 결과가 ADMET prediction model의 credibility뿐만 아니라 ChemMORT가 생성한 latent representation의 effectiveness와 utility를 입증한다고 해석한다. 특히 encoding-decoding network의 reversibility 덕분에 ADMET prediction model이 molecular optimization을 직접 navigate할 수 있다는 점이 강조된다.

Constrained Multi-Objective Optimization

논문에서는 PARP-1 inhibitor의 constrained multi-objective optimization을 case study로 제시한다. PARP-1은 base excision repair pathway에 관여하는 DNA repair enzyme으로, homologous recombination repair defect를 가진 cancer cell을 선택적으로 사멸시키는 표적이다. 그러나 대부분의 PARP-1 inhibitor는 poor aqueous solubility 문제를 가지고 있어, hydrophilic하면서도 potent한 inhibitor의 개발이 필요하다고 저자들은 이야기한다.

Initial molecule로는 approved drug Olaparib (IC50 = 0.9 nmol, solubility = 0.0601 mg/mL, logS = $-3.8$)이 선택된다. Optimization objective는 solubility, QED, SA이며, bioactivity motif constraint와 similarity constraint가 함께 적용된다. PSO는 cycle당 50 iteration으로 100회 반복하여 수행된다.

Figure 3. The information about the initial molecule, optimization process and representative optimized molecules. (A) The introduction of the initial molecule Olaparib and related optimization task set; (B) The averaged values of the final score, QED, LogS, SA and similarity during the 50-step optimization; (C) The docking score, final score and LogS value of Olaparib and 171 unique optimized molecules; (D) Five representative optimized molecules with their property information.

Figure 3B에서는 optimization 과정 동안 LogS와 final score가 초기에 급격히 상승한 후 안정 구간에 진입하는 양상이 관찰된다. QED는 0.70–0.72, SA는 2.5–2.7 사이에서 유지되며, similarity는 0.40 부근에서 안정된다. 논문에서는 bioactivity motif constraint로 인해 optimized molecule이 initial molecule에서 크게 벗어나지 않는 점을 structural constraint의 중요성을 보여주는 결과로 해석한다.

100 cycle의 optimization 후 171개의 unique optimized molecule이 생성되었다고 보고된다. Figure 3C는 이 분자들이 initial molecule 대비 높은 solubility와 final desirability score를 보임을 시각화하고 있다. 주요 구조 변화는 1,2-dihydrophthalazine을 1H-1,2,4-triazole, piperazine, imidazolidine (N,N-dimethylacetamide 결합)과 같은 보다 polar한 functional group으로 치환한 것이며, 저자들은 이러한 치환이 hydration을 강화하고 dissolution의 thermodynamic process를 촉진한다고 설명하고 있다.

Molecular docking 분석 결과 대부분의 optimized molecule이 높은 docking score를 보였으며, 이는 bioactivity motif와 similarity threshold 적용의 효과로 해석된다. Figure 3D에서는 5개의 representative optimized molecule이 제시되며, 모두 Olaparib 및 approved PARP-1 drug와 높은 structural similarity를 보인다.

Figure 4. The interactions between the generated active molecules with PARP1. Figures A-F correspond to the binding mode of Olaparib and the optimized molecules 1~5, respectively. The hydrogen-bonding, π-π stacking interactions and charged interactions are indicated by different colors.

Figure 4는 PARP-1 (PDB ID: 4L6S) target과 optimized molecule의 binding mode를 보여준다. 논문에서는 대부분의 optimized molecule이 Gly863, Ser904와의 H-bonding network 및 Tyr907과의 $\pi$-$\pi$ stacking이라는 PARP-1 inhibitor의 key interaction을 유지한다고 보고한다.

추가 검증을 위해 저자들은 molecular dynamics (MD) simulation을 수행한다. AMBER ff19SB force field와 General AMBER Force Field 2를 사용하여 20 ns production MD simulation을 진행하였고, 모든 system이 simulation 동안 안정적이었다고 이야기한다. Optimized molecule 3 ($-56.772$ kcal/mol), 4 ($-58.949$ kcal/mol), 5 ($-58.831$ kcal/mol)의 MM/GBSA binding free energy가 Olaparib ($-52.130$ kcal/mol)과 유사하거나 더 우수한 수준으로 보고된다.

Figure 5. The top 10 residues contributing to the binding free energies of the generated active molecules. Figures A-F correspond to the binding modes of Olaparib and the optimized molecules 1~5, respectively.

Figure 5는 residue energy decomposition 결과를 제시한다. 논문에서는 Tyr896, Tyr907, His862가 대부분의 optimized molecule에서 binding free energy에 가장 크게 기여하는 residue로 나타났고, 이는 PARP-1 inhibitor에서 알려진 residue contribution pattern과 일치한다고 설명하고 있다. 저자들은 이러한 결과가 privileged motif와 similarity constraint의 합리적 적용이 lead optimization 과정에서 bioactivity 유지에 필수적임을 보여준다고 해석한다.

Limitation

논문에서는 ChemMORT의 한계점에 대한 명시적 논의는 충분히 다루지 않으나, 본문 흐름에서 다음과 같은 검토 지점이 드러난다.

Decoder가 character-by-character 방식으로 SMILES를 생성하기 때문에, latent vector의 임의 조합이 invalid 또는 chemically infeasible한 molecule을 출력할 가능성이 존재한다고 저자들이 직접 언급하고 있다.
Scoring scheme이 user-defined weight와 recommended value range에 크게 의존하므로, weight 설정에 따라 optimization 결과가 달라질 수 있다는 점은 추가 검증이 필요하다고 볼 수 있다.
PARP-1 case study 외에 다양한 target class에 대한 generalization 검증은 제한적이므로, ChemMORT의 범용성에 대한 추가 실험이 부족하다고 볼 수 있다.
Optimized molecule의 실제 합성 가능성과 in vitro/in vivo bioactivity는 docking과 MD simulation의 in silico 결과로만 뒷받침되며, 실험적 검증은 본 논문 범위에 포함되지 않는다.

Conclusion

논문에서는 ChemMORT가 NMT 기반 molecular representation 학습, credible ADMET prediction model, multi-objective PSO strategy를 통합하여 lead molecule의 ADMET property를 multi-objective로 optimization하는 web-based platform이라고 정리한다. SMILES Encoder, Descriptor Decoder, Molecular Optimizer의 세 모듈이 representation, translation, optimization 기능을 각각 담당한다.

PARP-1 inhibitor case study에서는 Olaparib을 기반으로 한 constrained multi-objective optimization을 통해, target binding affinity를 보존하면서 ADMET property가 개선된 171개의 unique molecule이 생성되었다고 보고한다. 저자들은 ChemMORT의 합리적 활용을 통해 향후 improved ADMET profile을 가진 potent drug candidate 발견이 가능하다고 전망한다.