Synthemol¶
Note
- Before starting training and evaluation, please download the dataset used in the experiment Data.zip, and modify
data_dirin the yaml configuration file to the path of the decompressed dataset. For example: "./data/Data/..."; download resources.zip, and unzip it to examples/synthemol/synthemol/. - If you need to use a pre-trained model for evaluation, please download the pre-trained model pretrained.zip and unzip it, for example to the path ./pretrained/pretrained_chemprop.pdparams, and specify the path in PRE_COMPUTE.model_path of the yaml configuration file.
- Before starting training and generation, please install
rdkitetc. For related dependencies, please executepip install requirements.txtto install.
# Download pre-trained model (optional, or specify your own trained model in configuration file)
mkdir -p ./pretrained && wget -O ./pretrained/pretrained_chemprop.pdparams https://paddle-org.bj.bcebos.com/paddlescience/models/synthemol/pretrained_chemprop.pdparams
# Use antibiotics and other data to evaluate chemprop model to implement Property Predict
# Configuration can be modified in conf/synthemol.yaml
python main.py mode=eval
# Download pre-trained model (optional, or specify your own trained model in configuration file)
mkdir -p ./pretrained && wget -O ./pretrained/pretrained_chemprop.pdparams https://paddle-org.bj.bcebos.com/paddlescience/models/synthemol/pretrained_chemprop.pdparams
# Use trained model to score and compute building blocks to accelerate the next generation phase
# Configuration can be modified in conf/synthemol.yaml
python main.py mode=pre-compute
1. Background Introduction¶
The rapid emergence of pan-drug-resistant bacteria makes the development of structurally novel antibiotics urgent. Although artificial intelligence can discover new antibiotics, existing methods still have obvious flaws: property prediction models can only evaluate molecules one by one, which has extremely poor scalability when facing huge chemical spaces; while generative models can quickly explore huge chemical spaces, they often output molecules that are difficult to synthesize. To this end, the authors proposed SyntheMol, a generative model that can design new compounds that are easy to synthesize from a chemical space of nearly 30 billion molecules. The authors used SyntheMol to inhibit the growth of Acinetobacter baumannii (a tricky Gram-negative pathogen), synthesized 58 generated molecules and experimentally verified them, of which 6 structurally novel molecules showed antibacterial activity against Acinetobacter baumannii and other bacteria with significant phylogenetic differences. This study demonstrates the potential of generative AI to design structurally novel, synthesizable, and effective small-molecule antibiotic candidates in a huge chemical space, and provides experimental validation.
2. Synthemol Principle¶
This chapter only briefly introduces the model principle of Synthemol. For detailed theoretical derivation, please read Generative AI for designing and validating easily synthesizable and structurally novel antibiotics.
2.1 Property Predictor¶
Chemprop is a molecular property prediction model that uses directed message passing neural networks to process molecules and predict their properties. Chemprop first extracts simple atom and bond features (such as atom type and bond type) from the molecular graph to construct feature vectors for each atom and bond. Then, the model performs three rounds of message passing: in each round, the neural network layer iteratively fuses information from neighboring atoms and bonds. After message passing is completed, Chemprop sums all fused feature vectors to generate a single feature vector representing the entire molecule. This vector is then input into a two-layer feedforward neural network to predict molecular properties; in this study, it predicts the probability of inhibiting the growth of Acinetobacter baumannii. We use Chemprop v1.5.2, migrated from PyTorch v1.12.0.post2. For two other predictors, please refer to the original text.
2.2 Synthemol¶
SyntheMol is a generative model that explores a combinatorial chemical space composed of molecules generated by chemical reactions of molecular building blocks to find molecules with target properties. SyntheMol uses a Monte Carlo Tree Search (MCTS) algorithm similar to AlphaGo to efficiently search for ideal molecules in this chemical space. SyntheMol can not only quickly identify promising molecules, but also give their synthesis routes (that is, the complete steps of combining molecular building blocks through a series of one-step or multi-step chemical reactions). Below, we give the mathematical symbols required to describe the SyntheMol MCTS algorithm and provide the corresponding pseudocode.
SyntheMol MCTS Algorithm¶
Requires:
- Synthesis tree
T - Property prediction model
M - Maximum number of rollouts
n_rollout - Maximum number of reactions
n_reaction
function MCTS():
for i = 1 to n_rollout do:
rollout(T.root)
end for
return all visited nodes in T with:
1 molecule and ≥ 1 reaction
function rollout(N):
if node N has undergone ≥ n_reaction reactions then
return property prediction score of M applied to molecules in N
end if
E ← expand_node(N)
S ← select child node in E with largest MCTS score
return rollout(S)
function expand_node(N):
E ← empty set of nodes
foreach reaction R do
if R is compatible with molecules in N then
Add new node to E with each product of R applied to molecules in N
end if
end for
foreach building block B do
if any reaction is compatible with B and molecules in N then
Add new node to E with B and molecules in N
end if
end for
return E
3. Synthemol Model Implementation¶
Next, we will explain how to implement the training, pre-calculation score and generation of the Synthemol model based on PaddleScience code. For other details in this case, please refer to API Documentation.
3.1 Dataset Introduction¶
The dataset uses the Data.zip dataset from the author's repository Synthemol.
The training set consists of 3 compound libraries:
- Library 1 contains 2371 molecules from the Pharmakon-1760 library (containing 1360 FDA-approved drugs and 400 internationally approved drugs) and 800 natural products isolated from plant, animal and microbial sources.
- Library 2 is the Broad Drug Repurposing Hub, containing 6680 molecules, most of which are FDA-approved drugs or clinical candidate compounds.
- Library 3 is a small molecule synthesis screening library containing 5376 molecules, randomly sampled from a larger compound library of the Broad Institute.
All 3 libraries were screened for growth inhibition activity against Acinetobacter baumannii ATCC 17978 in duplicate biological replicates. The experimental process is as follows:
- The strain was cultured overnight in 2 ml LB medium at 37 °C, and then diluted 1:10 000 in fresh LB.
- Take 49.5 µl (384-well plate) or 99 µl (96-well plate) bacterial solution and add it to Corning flat-bottom microplate using manual or Agilent Bravo pipetting system.
- Add the test compound to each well, final concentration 50 µM, final volume 50 µl (384-well plate) or 100 µl (96-well plate).
- Incubate at 37 °C for 16 h.
- Read absorbance at 600 nm using SpectraMax M3 microplate reader (Molecular Devices), normalize data by intra-plate quartile mean, and then aggregate and determine positive hits.
For more details, including hyperparameter adjustment space for each model, please refer to the author's original paper. Specific hyperparameters used in this repository are preset in the yaml configuration file and can be adjusted according to the situation.
3.2 Chemprop Model Training¶
3.2.1 Constraint Construction¶
This case solves the problem based on data-driven methods, so it is necessary to use SupervisedConstraint built in PaddleScience to construct supervised constraints. Before defining constraints, you need to first specify various parameters used for data loading in supervised constraints.
The code for data loading is as follows:
Among them, the "dataset" field defines the used Dataset class name as MoleculeDatasetIter, and num_works is 1.
The code for defining supervised constraints is as follows:
| examples/synthemol/main.py | |
|---|---|
The first parameter of SupervisedConstraint is the data loading method, here train_dataloader_cfg defined above is used;
The second parameter is the definition of loss function, here a custom loss function is used; the author controls loss function selection by passing parameters through get_loss_func function: the Chemprop model in the paper uses CrossEntropyLoss;
The third parameter is the name of the constraint condition, which is convenient for subsequent indexing. Here it is named Sup.
3.2.2 Model Construction¶
In this case, the molecular property prediction model is implemented based on the Chemprop network model, expressed in PaddleScience code as follows:
| examples/synthemol/main.py | |
|---|---|
The parameters of the network model are set through the configuration file as follows:
| examples/synthemol/conf/synthemol.yaml | |
|---|---|
Among them, input_keys and output_keys represent the names of the input and output variables of the network model respectively.
3.2.3 Learning Rate and Optimizer Construction¶
The learning rate size used in this case is set to 0.0001. The optimizer uses Adam, and parameters are grouped, expressed in PaddleScience code as follows:
| examples/synthemol/main.py | |
|---|---|
3.2.4 Model Training¶
After completing the above settings, you only need to pass the instantiated objects to ppsci.solver.Solver in order, and then start training.
3.3 Pre-compute building blocks score¶
The code for constructing the model is:
| examples/synthemol/main.py | |
|---|---|
3.4 Synthemol Generate Molecules¶
The code for constructing Generator is:
4. Complete Code¶
| examples/synthemol/main.py | |
|---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 | |
5. Result Display¶
Evaluate the training effect of the first step Chemprop model. By loading the pre-trained model and executing the evaluation command, the results can be obtained:
| roc_auc | prc_auc | |
|---|---|---|
| chemprop | 0.797 | 0.332 |
Checking the generated molecules.csv, you can see the generated molecular information similar to the table below:
| smiles | node_id | num_expansions | rollout_num | score | Q_value | num_reactions | reaction_1_id | building_block_1_1_id | building_block_1_1_smiles | building_block_1_2_id | building_block_1_2_smiles |
|---|---|---|---|---|---|---|---|---|---|---|---|
| C#CCN(C(=O)C(C)(C)C#C)C1CCN(C(=O)OC(C)(C)C)CC1 | 91431 | 20 | 1 | 1 | 22 | 4349560 | C#CCNC1CCN(C(=O)OC(C)(C)C)CC1 | 2998277 | C#CC(C)(C)C(=O)O |
It can be seen that molecular information meeting the requirements is generated, which is consistent with the author's design purpose.