Suzuki-Miyaura Cross-Coupling Reaction Yield Prediction¶

Note

Before starting training and evaluation, please download the data file data_set.xlsx, and modify data_dir in the yaml configuration file to the actual file path of data_set.xlsx.
If you need to use a pre-trained model for evaluation, please download the pre-trained model smc_reac_model.pdparams, and modify load_model_path in the yaml configuration file to the model parameter path.
Before the first training and evaluation, please execute pip install -r requirements.txt to install rdkit and other related dependencies.

Model Training CommandModel Evaluation Command

# linux
wget -c https://paddle-org.bj.bcebos.com/paddlescience/datasets/SMCReac/data_set.xlsx
# windows
# curl https://paddle-org.bj.bcebos.com/paddlescience/datasets/SMCReac/data_set.xlsx -o data_set.xlsx
python smc_reac.py

# linux
wget -c https://paddle-org.bj.bcebos.com/paddlescience/datasets/SMCReac/data_set.xlsx
# windows
# curl https://paddle-org.bj.bcebos.com/paddlescience/datasets/SMCReac/data_set.xlsx -o data_set.xlsx
python smc_reac.py mode=eval EVAL.pretrained_model_path=https://paddle-org.bj.bcebos.com/paddlescience/models/smc_reac/smc_reac_model.pdparams

1. Background Introduction¶

The Suzuki-Miyaura cross-coupling reaction expression is shown below.

\[ \mathrm{Ar{-}X} + \mathrm{Ar'{-}B(OH)_2} \xrightarrow[\text{Base}]{\mathrm{Pd}^0} \mathrm{Ar{-}Ar'} + \mathrm{HX} \]

Catalyzed by zero-valent palladium complexes, aryl or alkenyl boronic acids or boronic esters undergo cross-coupling with chloro-, bromo-, iodoarenes or alkenes. This reaction has the advantages of mild reaction conditions and high conversion rate, and plays an important role in fields such as material synthesis and drug development, but has problems such as long development cycle and high trial and error cost. This study establishes a prediction model by using high-throughput experimental data to analyze the effects of reaction substrates (including electrophiles and nucleophiles), catalytic ligands, bases, and solvents on the yield of coupling reactions.

2. Implementation of Suzuki-Miyaura Cross-Coupling Reaction Yield Prediction Model¶

This section will explain how to implement the construction, training, testing and evaluation of the Suzuki-Miyaura cross-coupling reaction yield prediction model based on PaddleScience code. The directory structure of the case is as follows.

smc_reac/
├──config/
│   └── smc_reac.yaml
├── data_set.xlsx
├── requirements.txt
└── smc_reac.py

2.1 Dataset Construction and Loading¶

The data used in this example comes from the open source data provided in reference [1], considering only the influence of reagents themselves on experimental results, and filtering out partial reaction data where reagents participated in all components, saved in the file ./data_set.xlsx. This work developed an automated platform based on flow chemistry, which was assembled in an argon-protected glove box, using a modified high-performance liquid chromatography (HPLC) system combined with an automated sampling device to draw reaction components (electrophiles, nucleophiles, catalysts, ligands, and bases) from 192 reservoirs according to a set program and inject them into the flow carrier liquid. Each reaction segment reacts in a temperature-controlled reaction coil at a set flow rate, pressure, and time, and the reaction solution is detected in real time by UPLC-MS. By regulating the combination of electrophiles, nucleophiles, 11 ligands, 7 bases, and 4 solvents, a systematic screening of 5760 reaction conditions was finally achieved. Next, taking one piece of data as an example, combined with code to illustrate the construction and loading process of the dataset.

ClC=1C=C2C=CC=NC2=CC1 | CC=1C(=C2C=NN(C2=CC1)C1OCCCC1)B(O)O | C(C)(C)(C)P(C(C)(C)C)C(C)(C)C | [OH-].[Na+] | C(C)#N | 4.76

Where SMILES are used to represent electrophiles, nucleophiles, catalytic ligands, bases, solvents, and experimental yields in turn.

First, import experimental material information and reaction yield from the table file, and divide the training set and test set,

examples/smc_reac/smc_reac.py
def load_data(cfg: DictConfig):
    data_dir = cfg.data_dir
    dataset = pd.read_excel(data_dir, skiprows=1)
    x = dataset.iloc[:, 1:6]
    y = dataset.iloc[:, 6]
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=0.2, random_state=42
    )
    return x_train, x_test, y_train, y_test

Apply rdkit.Chem.rdFingerprintGenerator to convert the SMILES descriptions of electrophiles, nucleophiles, catalytic ligands, bases, and solvents into Morgan fingerprints. Morgan fingerprint is a vectorized description of molecular structure, encoded as hash values through local topology, and mapped to 2048-bit fingerprint bits. Expressed in PaddleScience code as follows

examples/smc_reac/smc_reac.py
def data_processed(x, y):
    x = build_dataset(x)
    y = paddle.to_tensor(y.to_numpy(dtype=np.float32))
    y = paddle.unsqueeze(y, axis=1)
    return x, y


def build_dataset(data):
    r1 = paddle.to_tensor(np.array(cal_print(data.iloc[:, 0])), dtype=paddle.float32)
    r2 = paddle.to_tensor(np.array(cal_print(data.iloc[:, 1])), dtype=paddle.float32)
    ligand = paddle.to_tensor(
        np.array(cal_print(data.iloc[:, 2])), dtype=paddle.float32
    )
    base = paddle.to_tensor(np.array(cal_print(data.iloc[:, 3])), dtype=paddle.float32)
    solvent = paddle.to_tensor(
        np.array(cal_print(data.iloc[:, 4])), dtype=paddle.float32
    )
    return paddle.concat([r1, r2, ligand, base, solvent], axis=1)


def cal_print(smiles):
    vectors = []
    for smi in smiles:
        mol = Chem.MolFromSmiles(smi)
        generator = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
        fp = generator.GetFingerprint(mol)
        _input = np.array(list(map(float, fp.ToBitString())))
        vectors.append(_input)
    return vectors

2.2 Constraint Construction¶

This case uses supervised learning. According to the PaddleScience API structure description, the built-in SupervisedConstraint is used to construct supervised constraints. Expressed in PaddleScience code as follows

examples/smc_reac/smc_reac.py
sup = ppsci.constraint.SupervisedConstraint(
    dataloader_cfg={
        "dataset": {
            "input": {"v": x_train},
            "label": {"u": y_train},
            # "weight": {"W": param},
            "name": "IterableNamedArrayDataset",
        },
        "batch_size": cfg.TRAIN.batch_size,
    },
    loss=ppsci.loss.MSELoss("mean"),
    name="sup",
)
constraint = {
    "sup": sup,
}

The second parameter of SupervisedConstraint indicates using mean squared error MSELoss as the loss function, and the third parameter indicates the name of the constraint condition, which is convenient for subsequent indexing.

2.3 Model Construction¶

This case designed five independent sub-networks (fully connected layer + ReLU activation), each sub-network extracts features of corresponding chemical substances respectively. Subsequently, these five feature vectors are weighted averaged through trainable weight parameters to achieve adaptive learning of the impact of different chemical components on reaction yield prediction. Finally, the fused features are input into a fully connected layer for further mapping to output the predicted value of reaction yield. The entire network structure reflects the independent extraction and weighted fusion of information of each component in the reaction, consistent with the characteristics of the reaction mechanism. Expressed in PaddleScience code as follows

ppsci/arch/smc_reac.py
class SuzukiMiyauraModel(base.Arch):
    def __init__(
        self, input_dim, hidden_dim, hidden_dim2, hidden_dim3, hidden_dim4, output_dim
    ):
        super().__init__()

        self.r1_fc = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim2),
            nn.ReLU(),
            nn.Linear(hidden_dim2, hidden_dim3),
        )

        self.r2_fc = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim2),
            nn.ReLU(),
            nn.Linear(hidden_dim2, hidden_dim3),
        )

        self.ligand_fc = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim2),
            nn.ReLU(),
            nn.Linear(hidden_dim2, hidden_dim3),
        )

        self.base_fc = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim2),
            nn.ReLU(),
            nn.Linear(hidden_dim2, hidden_dim3),
        )

        self.solvent_fc = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim2),
            nn.ReLU(),
            nn.Linear(hidden_dim2, hidden_dim3),
            nn.ReLU(),
        )

        self.weights = paddle.create_parameter(
            shape=[5],
            dtype="float32",
            default_initializer=paddle.nn.initializer.Assign(
                paddle.to_tensor([0.2, 0.2, 0.2, 0.2, 0.2])
            ),
        )

        self.fc_combined = nn.Sequential(
            nn.Linear(hidden_dim3, hidden_dim4),
            nn.ReLU(),
            nn.Linear(hidden_dim4, output_dim),
        )

    def weighted_average(self, features, weights):

        weights = weights.clone().detach()

        weighted_sum = sum(f * w for f, w in zip(features, weights))

        total_weight = weights.sum()

        return weighted_sum / total_weight

    def forward(self, x):
        x = self.concat_to_tensor(x, ("v"), axis=-1)

        input_splits = paddle.split(x, num_or_sections=5, axis=1)

        r1_input, r2_input, ligand_input, base_input, solvent_input = input_splits

        r1_features = self.r1_fc(r1_input)

        r2_features = self.r2_fc(r2_input)

        ligand_features = self.ligand_fc(ligand_input)

        base_features = self.base_fc(base_input)

        solvent_features = self.solvent_fc(solvent_input)

        features = [
            r1_features,
            r2_features,
            ligand_features,
            base_features,
            solvent_features,
        ]

        combined_features = self.weighted_average(features, self.weights)

        output = self.fc_combined(combined_features)
        output = self.split_to_dict(output, ("u"), axis=-1)
        return output

The model is instantiated according to the configuration file information

examples/smc_reac/smc_reac.py
model = ppsci.arch.SuzukiMiyauraModel(**cfg.MODEL)

Parameters are set through the configuration file as follows

examples/smc_reac/config/smc_reac.yaml
MODEL: #
  input_dim : 2048  # Assuming x_train is your DataFrame
  output_dim : 1
  hidden_dim : 512
  hidden_dim2 : 1024
  hidden_dim3 : 2048
  hidden_dim4 : 1024

2.4 Optimizer Construction¶

The trainer uses the Adam optimizer, and the learning rate setting is given by the configuration file. Expressed in PaddleScience code as follows

examples/smc_reac/smc_reac.py
optimizer = ppsci.optimizer.optimizer.Adam(cfg.TRAIN.learning_rate)(model)

2.5 Model Training¶

After completing the above settings, you only need to pass the instantiated objects to ppsci.solver.Solver in order, and then start training. Expressed in PaddleScience code as follows

examples/smc_reac/smc_reac.py
solver = ppsci.solver.Solver(
    model,
    constraint=constraint,
    optimizer=optimizer,
    epochs=cfg.TRAIN.epochs,
    eval_during_train=False,
    iters_per_epoch=cfg.TRAIN.iters_per_epoch,
    cfg=cfg,
)
solver.train()

3. Complete Code¶

examples/smc_reac/smc_reac.py
import os

import hydra
import matplotlib.pyplot as plt
import numpy as np
import paddle
import pandas as pd
import rdkit.Chem as Chem
from omegaconf import DictConfig
from rdkit.Chem import rdFingerprintGenerator
from sklearn.model_selection import train_test_split

import ppsci

os.environ["HYDRA_FULL_ERROR"] = "1"
os.environ["KMP_DUPLICATE_LIB_OK"] = "True"
plt.rcParams["axes.unicode_minus"] = False
plt.rcParams["font.sans-serif"] = ["DejaVu Sans"]

x_train = None
x_test = None
y_train = None
y_test = None


def load_data(cfg: DictConfig):
    data_dir = cfg.data_dir
    dataset = pd.read_excel(data_dir, skiprows=1)
    x = dataset.iloc[:, 1:6]
    y = dataset.iloc[:, 6]
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=0.2, random_state=42
    )
    return x_train, x_test, y_train, y_test


def data_processed(x, y):
    x = build_dataset(x)
    y = paddle.to_tensor(y.to_numpy(dtype=np.float32))
    y = paddle.unsqueeze(y, axis=1)
    return x, y


def build_dataset(data):
    r1 = paddle.to_tensor(np.array(cal_print(data.iloc[:, 0])), dtype=paddle.float32)
    r2 = paddle.to_tensor(np.array(cal_print(data.iloc[:, 1])), dtype=paddle.float32)
    ligand = paddle.to_tensor(
        np.array(cal_print(data.iloc[:, 2])), dtype=paddle.float32
    )
    base = paddle.to_tensor(np.array(cal_print(data.iloc[:, 3])), dtype=paddle.float32)
    solvent = paddle.to_tensor(
        np.array(cal_print(data.iloc[:, 4])), dtype=paddle.float32
    )
    return paddle.concat([r1, r2, ligand, base, solvent], axis=1)


def cal_print(smiles):
    vectors = []
    for smi in smiles:
        mol = Chem.MolFromSmiles(smi)
        generator = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=2048)
        fp = generator.GetFingerprint(mol)
        _input = np.array(list(map(float, fp.ToBitString())))
        vectors.append(_input)
    return vectors


def train(cfg: DictConfig):
    global x_train, y_train
    x_train, y_train = data_processed(x_train, y_train)

    # build supervised constraint
    sup = ppsci.constraint.SupervisedConstraint(
        dataloader_cfg={
            "dataset": {
                "input": {"v": x_train},
                "label": {"u": y_train},
                # "weight": {"W": param},
                "name": "IterableNamedArrayDataset",
            },
            "batch_size": cfg.TRAIN.batch_size,
        },
        loss=ppsci.loss.MSELoss("mean"),
        name="sup",
    )
    constraint = {
        "sup": sup,
    }

    model = ppsci.arch.SuzukiMiyauraModel(**cfg.MODEL)

    optimizer = ppsci.optimizer.optimizer.Adam(cfg.TRAIN.learning_rate)(model)

    # Build solver
    solver = ppsci.solver.Solver(
        model,
        constraint=constraint,
        optimizer=optimizer,
        epochs=cfg.TRAIN.epochs,
        eval_during_train=False,
        iters_per_epoch=cfg.TRAIN.iters_per_epoch,
        cfg=cfg,
    )
    solver.train()


def evaluate(cfg: DictConfig):
    global x_test, y_test

    x_test, y_test = data_processed(x_test, y_test)

    test_validator = ppsci.validate.SupervisedValidator(
        dataloader_cfg={
            "dataset": {
                "input": {"v": x_test},
                "label": {"u": y_test},
                "name": "IterableNamedArrayDataset",
            },
            "batch_size": cfg.EVAL.batch_size,
            "shuffle": False,
        },
        loss=ppsci.loss.MSELoss("mean"),
        metric={
            "MAE": ppsci.metric.MAE(),
            "RMSE": ppsci.metric.RMSE(),
            "R2": ppsci.metric.R2Score(),
        },
        name="test_eval",
    )
    validators = {"test_eval": test_validator}

    model = ppsci.arch.SuzukiMiyauraModel(**cfg.MODEL)
    solver = ppsci.solver.Solver(
        model,
        validator=validators,
        cfg=cfg,
    )

    loss_val, metric_dict = solver.eval()

    ypred = model({"v": x_test})["u"].numpy()
    ytrue = y_test.numpy()

    mae = metric_dict["MAE"]["u"]
    rmse = metric_dict["RMSE"]["u"]
    r2 = metric_dict["R2"]["u"]

    plt.figure()
    plt.scatter(ytrue, ypred, s=15, color="royalblue", marker="s", linewidth=1)
    plt.plot([ytrue.min(), ytrue.max()], [ytrue.min(), ytrue.max()], "r-", lw=1)
    plt.legend(title="R²={:.3f}\n\nMAE={:.3f}".format(r2, mae))
    plt.xlabel("Test Yield(%)")
    plt.ylabel("Predicted Yield(%)")
    save_path = "smc_reac.png"
    plt.savefig(save_path)
    print(f"Image saved to: {save_path}")
    plt.show()

    print("Evaluation metrics:")
    print(f"Loss: {loss_val:.4f}")
    print(f"MAE : {mae:.4f}")
    print(f"RMSE: {rmse:.4f}")
    print(f"R2  : {r2:.4f}")


@hydra.main(version_base=None, config_path="./config", config_name="smc_reac.yaml")
def main(cfg: DictConfig):
    global x_train, x_test, y_train, y_test

    x_train, x_test, y_train, y_test = load_data(cfg)

    if cfg.mode == "train":
        train(cfg)
    elif cfg.mode == "eval":
        evaluate(cfg)
    else:
        raise ValueError(f"cfg.mode should in ['train', 'eval'], but got '{cfg.mode}'")


if __name__ == "__main__":
    main()

4. Result Display¶

The figure below shows the model prediction results for the yield of the Suzuki-Miyaura cross-coupling reaction.

Model prediction results for Suzuki-Miyaura cross-coupling reaction yield

5. References¶

[1] Perera D, Tucker J W, Brahmbhatt S, et al. A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow[J]. Science, 2018, 359(6374): 429-434.