1,853
Views
1
CrossRef citations to date
0
Altmetric
Editorial

The impact of machine learning on future tuberculosis drug discovery

Pages 925-927 | Received 29 Apr 2022, Accepted 29 Jul 2022, Published online: 02 Aug 2022

1. Introduction

Tuberculosis (TB) is an infectious disease, mainly infecting the lung, that is usually caused by the Mycobacterium tuberculosis bacteria. It is one of the leading infectious disease killers, claiming 1.5 million lives each year. Of the 10 million individuals who become ill with TB each year, ~30% are not identified by health systems. According to the US Centers for Disease Control in 2018, 1.7 billion people (23% of the world’s population) are infected with TB (https://www.cdc.gov/globalhealth/newsroom/topics/tb/index.html). Although effective drugs like rifamycin and isoniazid are available, resistance to anti-TB drugs due to drug misuse is a serious issue. There is a strong need to develop new, potent drugs to treat TB and prevent its spread.

Key proteins in TB have been identified, providing drug targets and experimental structures for rational drug design. High throughput screening of chemical libraries has provided resources for the design, discovery, or optimization of novel drugs to treat TB. However, target-based approaches often suffer from an inability to translate potent small-molecule protein modulators to effective human drugs.

The impressive performance of machine learning (ML) methods in many areas of science, technology, and medicine has seen a dramatic increase in their use to design or discover drugs to treat neglected tropical diseases [Citation1]. Because TB is a leading infectious disease killer and there are problems with efficacy and rapidly emerging resistance, finding new, effective, and safe anti-TB drugs is a high priority. ML methods are complementary to structural biology-based methods for the design of drugs to treat TB because they can be trained on a wide range of in vitro screening or in vivo and target-based data.

In drug design, ML represents a modern implementation of the quantitative structure–activity relationship (QSAR) paradigm developed in the 1960s by Hansch and Fujita. This posits a relationship between changes in drug structures and efficacy. They showed that mathematical or statistical models could capture this relationship quantitatively if the relevant physicochemical properties of the drugs could be encoded mathematically.

ML methods span a range of algorithms, from simple linear regression to complex deep learning methods. Linear models find simple relationships between the chemistry of molecules in the training set (used to generate the model) and biological activity. 3D QSAR models are a special case of linear models where the structural properties are generated from a grid of interaction points imposed on each molecule in the training set, and different types of interaction energies (steric, electrostatic, hydrophobic, hydrogen bond donor, or acceptor) are calculated at each grid point. The comparative molecular field analysis (CoMFA) and comparative molecular similarity indices analysis (COMSIA) algorithms are widely used to generate grid energies. However, these methods require molecules in the training set to have a common core that allows rational alignment in 3D. The use of 3D structures introduces complications because of the existence of multiple low energy conformations (shapes) for each molecule. Linear regression (e.g. multiple linear regression (MLR)) and 3D QSAR methods have the disadvantage that the relationship they are modeling may not be linear.

ML studies of anti-TB drugs generally fall into one of the several types. The early studies employed mainly linear regression or a 3D molecular field-based methods to map the relationships between molecular structure and biological activity. Of the papers on QSAR and ML studies of anti-TB drugs, over 35% employed 3D QSAR methods and over 50% reported models developed by MLR or 3D QSAR, both ML methods that assume a linear relationship between the chemical structures of the training set and the biological activities. These studies also often modeled relatively small numbers of anti-TB drug candidates, usually of a single chemotype. For example, Ragno et al. reported the synthesis and 3D QSAR modeling of 29 antimycobacterial pyrroles with minimum inhibitory concentration (MIC) values between 0.5 and 250 µg/ml [Citation2]. They used MLR, CoMFA, and a combination of both and found that MLR and combined models were better at predicting the MIC values of a six compound test set than the CoMFA model.

Subsequently, Kumar and Siddiqi reported a 3D QSAR study of 37 structurally related arylamides as M. tuberculosis enoyl acyl carrier protein reductase inhibitors using CoMFA, COMSIA, and molecular docking [Citation3]. The IC50 values ranged from 90 nM to 39 µM. Both 3D QSAR models could accurately recapitulate the biological activity of the test set compounds, with r2pred = 0.88 and standard error of prediction = 0.24 for both models. The docking studies showed that the contribution of molecular properties to the 3D QSAR models was consistent with the observed binding mode of the arylamines to the enzyme active site ().

Figure 1. Superimposition of 37 structurally related arylamides used in the 3D QSAR model bound to M. tuberculosis enoyl acyl carrier protein reductase. Reprinted by permission from Kumar and, Siddiqi, Journal of Molecular Modeling. 2010 May;16(5):877–893 doi:10.1007/s00894-009-0584-0 [Citation3].

Figure 1. Superimposition of 37 structurally related arylamides used in the 3D QSAR model bound to M. tuberculosis enoyl acyl carrier protein reductase. Reprinted by permission from Kumar and, Siddiqi, Journal of Molecular Modeling. 2010 May;16(5):877–893 doi:10.1007/s00894-009-0584-0 [Citation3].

Despite the relatively small domains of applicability of these models, some authors used the models to virtually screen large databases of candidate anti-TB drugs. For example, Maganti et al. generated 3D QSAR models using a dataset of 80 diverse inhibitors of the aryl acid adenylating enzyme involved in siderophore biosynthesis in M. tuberculosis [Citation4]. They used the CoMFA and COMSIA models to virtually screen 2.3 million compounds from the ZINC database and identified 13 hits as novel prospective inhibitors of the enzyme. However, predictions so far outside the domains of applicability of the models will be unreliable.

Recent studies (2016 onwards) have exploited a wider range of ML methods – random forests (RF), support vector machines (SVM), Bayesian methods, deep and shallow neural networks, and others to identify more complex, nonlinear relationships between molecular properties and efficacy. The Ekins group has been prominent in applying ML to anti-TB drug discovery. They have championed the use of Bayesian methods to create predictive classification models. Ekins et al. used data from in vitro testing of a library of 639 compounds to create binary classification ML models for M. tuberculosis topoisomerase I [Citation5]. Their Laplacian-corrected Bayesian classifier model had a fivefold cross-validated receiver operator characteristic of 0.74 and sensitivity, specificity, and concordance values above 0.76, similar to results from SVM and RF models, and was used to select commercially available compounds for testing.

The recent availability of larger and more chemically diverse datasets has allowed the training of models with substantially increased domains of applicability, making them more useful for virtual screening. These recent approaches have broken away from the original QSAR paradigm that was focused on understanding the basis for drug action rather than accurate prediction of drug properties [Citation6].

2. Expert opinion

Clearly, the goal of using ML methods is to accelerate discovery and optimization of anti-TB drugs, mirroring successes in other diseases. The main pitfalls in the field are the assumption that relationships between chemical structure and biological activity are linear; the small number and limited chemical diversity of molecules used to generate earlier models; and use of models with small domains of applicability to screen large databases of candidate drugs dissimilar to those used to generate the models. The value of data-driven ML is highly dependent on the diversity, range, and quality of the data used to train models, a major limitation until recently. Now, large datasets on the effects of molecules on TB are being generated, and automation and robotics have facilitated screening of large numbers of molecules in in vitro assays for anti-TB activity. This should also see an increase in models that generate quantitative predictions, potentially displacing the current focus on classification methods that discard useful information about structure–activity relationships.

Given the spectacular progress in applying ML to complex problems in science, technology, and medicine, the potential for ML to accelerate the discovery and development of anti-TB drugs is high. As larger and more diverse data sets become available, the utility of ML models will increase, particularly for virtual screening of large databases of hypothetical drugs. To date, only a relatively small number of ML methods have been applied to anti-TB drug research, so there is scope for improvements in model quality in the future. Thus, the most urgent need to drive this field forward is larger and more chemically diverse datasets. If these can be generated, the ultimate goal of using ML methods to discover new drugs with novel structures and new modes of action for treating TB will be achieved.

In the future, the field will see a substantial increase in the application of ML methods for anti-TB drug discovery, as better data become available, researchers realize that ML models are not hard to implement, and public domain software becomes even more available. Given the vast size of drug-like space (~1060 molecules), clearly high throughput experiments or ML models alone cannot explore more than a tiny fraction of chemistry space. Other AI methods that use evolutionary methods to explore widely diverse chemistries to find new chemotypes or modes of action will become increasingly common for drug design in general [Citation7] and anti-TB drug discovery specifically.

Declaration of interests

The author has no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

Reviewer disclosures

Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Additional information

Funding

This paper was not funded.

References

  • Winkler DA. Use of artificial intelligence and machine learning for discovery of drugs for neglected tropical diseases. Front Chem. 2021;9:614073.
  • Ragno R, Marshall GR, Di Santo R, et al. Antimycobacterial pyrroles: synthesis, anti-mycobacterium tuberculosis activity and QSAR studies. Bioorg Med Chem. 2000 Jun;8(6):1423–1432.
  • Kumar A, Siddiqi MI. Receptor based 3D-QSAR to identify putative binders of mycobacterium tuberculosis enoyl acyl carrier protein reductase. J Mol Model. 2010 May;16(5):877–893.
  • Maganti L, Ghoshal N, Consortium O. 3D-QSAR studies and shape based virtual screening for identification of novel hits to inhibit MbtA in mycobacterium tuberculosis. J Biomol Struct Dyn. 2015 Feb 1;33(2):344–364.
  • Ekins S, Godbole AA, and Keri G, et al. Machine learning and docking models for mycobacterium tuberculosis topoisomerase I. Tuberculosis. 2017 Mar;103:52–60.
  • Fujita T, and Winkler DA. Understanding the roles of the “two QSARs.” J Chem Inf Model. 2016 Feb22;56(2):269–274.
  • Le TC, Winkler DA. A bright future for evolutionary methods in drug design. ChemMedChem. 2015 Aug;10(8):1296–1300.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.