FEP augmentation as a means to solve data paucity problems for machine learning in chemical biology

dc.contributor.authorBurger, Pieter B.
dc.contributor.authorHu, Xiaohu
dc.contributor.authorBalabin, Ilya
dc.contributor.authorMuller, Morne
dc.contributor.authorStanley, Megan
dc.contributor.authorJoubert, Fourie
dc.contributor.authorKaiser, Thomas M.
dc.date.accessioned2025-03-28T04:22:58Z
dc.date.available2025-03-28T04:22:58Z
dc.date.issued2024-04-23
dc.descriptionDATA AVAILABILITY STATEMENT : All software generated for this paper is available in the Supporting Information. The KNIME analytics platform can be downloaded for free at https://www.knime.com/. All KNIME workflows are provided within the Supporting Information. All necessary data to replicate the study can be found in the public domain or within the provided Supporting Information.en_US
dc.descriptionSUPPORTING INFORMATION : Comprehensive description of the methodologies and parameters employed; list of the chemicals involved in this research; outcomes for each FEP calculation; MD reports; workflow of the ML experiments, including the corresponding initial data; and ML performance at two additional categorical cutoff values (PDF) MD reports (ZIP) Input structure data (ZIP) FEPML workflows (ZIP) FEPML results (ZIP) Compound list (ZIP) SMILES (CSV) Processing Data Workflow (ZIP)en_US
dc.description.abstractIn the realm of medicinal chemistry, the primary objective is to swiftly optimize a multitude of chemical properties of a set of compounds to yield a clinical candidate poised for clinical trials. In recent years, two computational techniques, machine learning (ML) and physics-based methods, have evolved substantially and are now frequently incorporated into the medicinal chemist’s toolbox to enhance the efficiency of both hit optimization and candidate design. Both computational methods come with their own set of limitations, and they are often used independently of each other. ML’s capability to screen extensive compound libraries expediently is tempered by its reliance on quality data, which can be scarce especially during early-stage optimization. Contrarily, physics-based approaches like free energy perturbation (FEP) are frequently constrained by low throughput and high cost by comparison; however, physics-based methods are capable of making highly accurate binding affinity predictions. In this study, we harnessed the strength of FEP to overcome data paucity in ML by generating virtual activity data sets which then inform the training of algorithms. Here, we show that ML algorithms trained with an FEP-augmented data set could achieve comparable predictive accuracy to data sets trained on experimental data from biological assays. Throughout the paper, we emphasize key mechanistic considerations that must be taken into account when aiming to augment data sets and lay the groundwork for successful implementation. Ultimately, the study advocates for the synergy of physics-based methods and ML to expedite the lead optimization process. We believe that the physics-based augmentation of ML will significantly benefit drug discovery, as these techniques continue to evolve.en_US
dc.description.departmentBiochemistry, Genetics and Microbiology (BGM)en_US
dc.description.librarianam2024en_US
dc.description.sdgSDG-09: Industry, innovation and infrastructureen_US
dc.description.sponsorshipAvicenna Biosciences, Inc.en_US
dc.description.urihttps://pubs.acs.org/journal/jcisd8en_US
dc.identifier.citationBurger, P.B., Hu, X., Balabin, I. et al. 2024, 'FEP augmentation as a means to solve data paucity problems for machine learning in chemical biology', Journal of Chemical Information and Modeling, vol. 64, no. 9, pp. 3812–3825, doi : 10.1021/acs.jcim.4c00071.en_US
dc.identifier.issn1549-9596 (print)
dc.identifier.issn1549-960X (online)
dc.identifier.other10.1021/acs.jcim.4c00071
dc.identifier.urihttp://hdl.handle.net/2263/101771
dc.language.isoenen_US
dc.publisherAmerican Chemical Societyen_US
dc.rights© 2024 The Authors. This article is licensed under CC-BY-NC-ND 4.0.en_US
dc.subjectMedicinal chemistryen_US
dc.subjectClinical trialsen_US
dc.subjectFree energy perturbation (FEP)en_US
dc.subjectMachine learning (ML)en_US
dc.subjectPhysics-based methodsen_US
dc.subjectSDG-09: Industry, innovation and infrastructureen_US
dc.titleFEP augmentation as a means to solve data paucity problems for machine learning in chemical biologyen_US
dc.typeArticleen_US

Files

Original bundle

Now showing 1 - 5 of 9
Loading...
Thumbnail Image
Name:
Burger_FEP_2024.pdf
Size:
5.69 MB
Format:
Adobe Portable Document Format
Description:
Article
Loading...
Thumbnail Image
Name:
Burger_FEPSuppl1_2024.pdf
Size:
589.45 KB
Format:
Adobe Portable Document Format
Description:
Supplementary Material 1
Loading...
Thumbnail Image
Name:
Burger_FEPSuppl2_2024.zip
Size:
13.95 MB
Format:
Unknown data format
Description:
Supplementary Material 2
Loading...
Thumbnail Image
Name:
Burger_FEPSuppl3_2024.zip
Size:
1.24 MB
Format:
Unknown data format
Description:
Supplementary Material 3
Loading...
Thumbnail Image
Name:
Burger_FEPSuppl4_2024.zip
Size:
1.06 MB
Format:
Unknown data format
Description:
Supplementary Material 4

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: