Drug Screening Challenge:

Challenge Description
The traditional drug discovery process is expensive and time-consuming. Accelerating this
process a significant challenge, for which taking a ML/AI-based pre-screening approach could
assist us with high-throughput virtual pre-screening of huge amount of drug candidates to
identify highly potent candidates for experimental testing and further validation.


The goal of this challenge is to build effective ML/AI-based surrogate models that can accurately
predict the docking scores of candidate drug molecules on SARS-CoV-2 protein targets.


The datasets to be used in this Challenge have been generated and curated by researchers at the
Brookhaven National Laboratory (BNL) in collaboration with other DOE (Department of Energy)
National Laboratories who have joined forces to combat the COVID-19 pandemics. The provided
datasets contain the docking scores of molecules (i.e., drug candidates) on SARS-CoV-2 protein
targets. Drug candidates are represented in SMILES strings and selected from known drug
databases such as ENAMINE [1], ZINC [2] and DrugBank [3]. All SMILES strings have been
canonicalized. The COVID-19 protein targets were provided by researchers at Argonne National
Laboratory (ANL). The Docking scores were obtained from Autodock 4.2 [4] and then collected
and organized into a CSV file, where rows represent molecules and columns represent different
docking targets.
The whole dataset includes docking scores of 300,457 molecules on 18 different COVID-19
related protein docking targets. Part of this data will be provided for training and initial validation.
The rest will be held out for evaluating the performance of the models trained by the participants.
The training/validation dataset will include the SMILES string representing 270,000 molecules
and their docking scores against different targets. In the test set, only the SMILES strings will be
provided for 30,457 molecules without their docking scores. The participants will need to train
their own model that can be used for accurate prediction of the docking scores of these
molecules on different targets. The dataset can be obtained from the github repo [5] created for
the Challenge: https://github.com/BC3D/BC3D_2021

Evaluation criteria:

The participants should submit the docking scores for the molecules in the test set predicted by
their trained model. CSV format (same as the training/validation dataset provided to the
participants) should be used for submitting the predicted scores. The predicted scores will be
compared with the ground truth docking scores, based on which the model accuracy will be
assessed in terms of the averaged mean absolute error (MAE) over all the targets.


References & Resources:

[1] https://enamine.net/compound-libraries/diversity-libraries

[2] https://zinc.docking.org/

[3] https://go.drugbank.com/releases/latest

[4] The original docking score data used for creation of the training/testing/validation datasets have been obtained from AutoDock. Further information regarding AutoDock can be found at the following URL: http://autodock.scripps.edu/

[5] https://github.com/BC3D/BC3D_2021